6.1 Introduction

Graph is a natural way to represent objects and their relationships. As a typical non-Euclidean data structure, it provides a flexible way to model the interactions between individual units in our daily lives. For example, social media networks, citation graphs, biological networks, and recommendation systems can all be modeled as graph structures.

Graph structure plays an equally important role in the NLP area as the sequence form introduced in the previous chapters. As shown in Fig. 6.1, multiple granularities of text can be organized in graph forms:

  1. 1.

    Word Level. There are a variety of syntactic/semantic relations between the words within a sentence or a document. For example, we can obtain the dependency relations between words by dependency parsing or the coreference relations between words by coreference resolution. These syntactic/semantic relationships can effectively help us understand the compositional semantics of different constituents in the sentence and document. We can naturally regard words as nodes and their syntactic/semantic relations as edges, representing a sentence or a document as a relational graph.

    Fig. 6.1
    An illustration depicts the different levels of text-based graphs as follows. Words, documents, and knowledge.

    An illustrative example of text-based graphs in different levels. The documents used in the figure are obtained from Wikipedia’s official website

  2. 2.

    Document Level. In the real world, documents usually connect with each other. For example, in the articles of Wikipedia, the entity mentions within an article may exist human-annotated hyperlinks to the articles describing these entities, or a scientific paper may refer to other scientific papers as its related works. The connected documents may provide important background knowledge to comprehend the meaning of the document. We can represent the interactions between documents as a document graph by regarding documents as nodes and hyperlinks as edges.

  3. 3.

    Knowledge Level. Human knowledge, such as world, linguistic, commonsense, and domain knowledge is essential for understanding languages. In practice, most of this knowledge can be organized in a graph form. For example, the world knowledge graph in Chap. 9 regards entities as nodes and their relationships as edges. Representing human knowledge as graphs enables further complex reasoning over the document and knowledge.

It is critical to model the structural information in graph data, which could help models better understand, categorize, and reason text in NLP tasks. Graph representation learning aims to learn the low-dimensional representations of nodes or the entire graph. The geometric relationship in the low-dimensional semantic space should effectively reflect the structural information of the original graph, such as the global topological structure or the local graph connection.

In this chapter, we discuss how to properly represent graph data and their characteristics by starting from symbolic representations (Sect. 6.2). Then, we move to distributed representations including the local representations of nodes (Sects. 6.3 and 6.4) as well as the global representation of the whole graph (Sect. 6.5). We also introduce the recent advances of self-supervised learning on graphs (Sect. 6.6). Finally, we present how text is processed in graph form in downstream NLP tasks (Sect. 6.7), mainly for word, sentence, and document levels. We leave a detailed introduction of knowledge-level applications in Chap. 9.

6.2 Symbolic Graph Representation

A common practice is to denote the graph with its node set \(\mathcal {V}\) and the edge set \(\mathcal {E}\). We can naturally represent a graph as \(\mathcal {G}=(\mathcal {V}, \mathcal {E})\), where \(e=(v_i,v_j)\in \mathcal {E}\) is a directed edge from node vi to vj or an undirected one between vi and vj. When processing graph data in a computer, we usually represent the connections in a graph as an adjacency matrix.

For an undirected graph, as shown in Fig. 6.2, we construct an adjacency matrix \(\mathbf {A}\in \mathbb {R}^{|\mathcal {V}|\times |\mathcal {V}|}\). If there is any edge between node v and node u, i.e., \((v,u)\in \mathcal {E}\), we have the corresponding element Avu = Auv = 1; otherwise Avu = Auv = 0.

Fig. 6.2
An illustration depicts an undirected graph connecting nodes A to E, which leads to a five-by-five matrix with columns and rows labeled from A to E. The matrix consists of the binary numbers 0 and 1.

The adjacency matrix representation for an undirected graph

For a directed graph, as shown in Fig. 6.3, Avu = 1 indicates there is an edge from node v to node u.

Fig. 6.3
An illustration depicts a directed graph connecting nodes A to E, which leads to a five-by-five matrix with columns and rows labeled from A to E. The matrix consists of the binary numbers 0 and 1.

The adjacency matrix representation for a directed graph

Moreover, for a weighted graph, we can store the weights of the edges instead of binary values in the adjacency matrix A, as shown in Fig. 6.4.

Fig. 6.4
An illustration depicts a weighted graph connecting nodes A to E, which leads to a five-by-five matrix with columns and rows labeled from A to E. The matrix consists of the binary numbers 0 and 1.

The adjacency matrix representation for a weighted graph

In the era of statistical learning, such symbolic representations are widely used in graph-based NLP such as TextRank [67], where word and sentence graphs can be respectively built for keyword extraction [6] and extractive document summarization [66]. Although convenient and straightforward, the representation of the adjacency matrix suffers from the scalability problem. Firstly, adjacency matrix A takes \(|\mathcal {V}|\times |\mathcal {V}|\) storage space, which is usually unacceptable when \(|\mathcal {V}|\) grows large. Secondly, the adjacency matrix is usually sparse, which means most of its entries are zeros. The data sparsity makes discrete algorithms applicable, but it is still hard to develop efficient algorithms for statistical learning [81].

6.3 Shallow Node Representation Learning

To address the above issues, shallow node representation learning methods propose to map nodes to low-dimensional vectors. Formally, the goal is to learn a d-dimensional vector representation \(\mathbf {v}\in \mathbb {R}^d\) for every node \(v\in \mathcal {V}\) in the graph. Learned node representations are supposed to capture the graph’s structure information and then can be used as input features for downstream graph-related tasks. In this section, we will introduce several kinds of shallow node representation learning methods, including spectral clustering, shallow neural networks, and matrix factorization.

6.3.1 Spectral Clustering

Early methods of shallow node representation are usually based on spectral clustering, which typically computes the first k eigenvectors (or singular vectors) of an affinity matrix, such as adjacency or Laplacian matrix of the graph. Now we present several algorithms based on spectral clustering, including locally linear embedding, Laplacian Eigenmap, directed graph embedding, and latent social dimensions.

Locally Linear Embedding (LLE)

LLE [88] assumes that the representations of a node and its neighbors lie in a locally linear patch of the manifold. In other words, a node’s representation can be approximated by a linear combination of the representation of its neighbors. LLE uses the linear combination of neighbors to reconstruct the center node. Formally, the reconstruction error of all nodes can be expressed as

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{W},\mathbf{V})=\sum_{v\in \mathcal{V}}\Vert \mathbf{v}-\sum_{u\in \mathcal{V}} {\mathbf{W}}_{vu}\mathbf{u} \Vert^2, {} \end{aligned} $$
(6.1)

where \(\mathbf {V} \in \mathbb {R}^{|\mathcal {V}|\times d}\) is the representation matrix containing all node representations v and Wvu is the learnable contribution coefficient of node u to v. LLE enforces Wvu = 0 if v and u are not connected, i.e., \((v,u)\not \in \mathcal {E}\). Further, the summation of a row of matrix W is set to 1, i.e., \(\sum _{u\in \mathcal {V}} {\mathbf {W}}_{vu}=1\).

Equation (6.1) is solved by alternatively optimizing weight matrix W and representation V. The optimization over W can be solved as a least-squares problem. The optimization over V leads to the following optimization problem:

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{W},\mathbf{V})=\sum_{v\in \mathcal{V}}\Vert\mathbf{v}-\sum_{u\in \mathcal{V}} {\mathbf{W}}_{vu}\mathbf{u}\Vert^2, \end{aligned} $$
(6.2)
$$\displaystyle \begin{aligned} s.t. \ \sum_{v\in \mathcal{V}} \mathbf{v}=\mathbf{0},\quad \quad |\mathcal{V}|{}^{-1} \sum_{v\in \mathcal{V}} {\mathbf{v}}^{\top}\mathbf{v}={\mathbf{I}}_d, \end{aligned} $$
(6.3)

where Id denotes d × d identity matrix. The conditions in Eq. (6.3) ensure the uniqueness of the solution. The first condition enforces the center of all node representations to zero point, and the second condition guarantees different coordinates have the same scale, i.e., equal contribution to the reconstruction error.

The optimization problem can be formulated as the computation of eigenvectors of matrix \(({\mathbf {I}}_{|\mathcal {V}|}-{\mathbf {W}}^{\top })({\mathbf {I}}_{|\mathcal {V}|}-\mathbf {W})\), which is an easily solvable eigenvalue problem. More details can be found in the note [20].

Laplacian Eigenmap

Laplacian Eigenmap [7] simply follows the idea that the representations of two connected nodes should be close. Specifically, the “closeness” is measured by the square of Euclidean distance. We use \(\mathbf {D}\in \mathbb {R}^{|\mathcal {V}|\times |\mathcal {V}|}\) to denote diagonal degree matrix where Dvv is the degree of node v. By defining the Laplacian matrix L as the difference between D and adjacency matrix A, we have L = D −A. Laplacian Eigenmap algorithm wants to minimize the following objective:

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{V})=\sum_{\{v,u|(v,u)\in \mathcal{E}\}}\Vert\mathbf{v}-\mathbf{u}\Vert^2, {} \end{aligned} $$
(6.4)
$$\displaystyle \begin{aligned} s.t.\ {\mathbf{V}}^{\top}\mathbf{D}\mathbf{V}={\mathbf{I}}_d. {} \end{aligned} $$
(6.5)

The cost function is the summation of the square loss of all connected node pairs, and the condition prevents the trivial all-zero solution caused by arbitrary scale. Equation (6.4) can be reformulated in matrix form as

$$\displaystyle \begin{aligned} {\mathbf{V}}^*=\underset{{\mathbf{V}}^{\top}\mathbf{DV}={\mathbf{I}}_d}{\arg\min} \text{tr}({\mathbf{V}}^{\top}\mathbf{L}\mathbf{V}), {} \end{aligned} $$
(6.6)

where tr(⋅) is the matrix trace function. The optimal solution V of Eq. (6.6) is the eigenvectors corresponding to d smallest nonzero eigenvalues of L. Note that the Laplacian Eigenmap algorithm can be easily generalized to the weighted graph.

A significant limitation of both LLE and Laplacian Eigenmap is that they have a symmetric cost function, leading to both algorithms not being applied to directed graphs.

Directed Graph Embedding (DGE)

DGE [17] generalizes Laplacian Eigenmap for both directed and undirected graphs based on a predefined transition matrix. For example, we can define a transition probability matrix \(\mathbf {P}\in \mathbb {R}^{|\mathcal {V}|\times |\mathcal {V}|}\) where Pvu denotes the probability that node v walks to u. The transition matrix defines a Markov random walk through the graph. We denote the stationary value of node v as πv where ∑vπv = 1. The stationary distribution of random walks is commonly used in many ranking algorithms such as PageRank [74]. DGE designs a new cost function that emphasizes those important nodes with higher stationary values:

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{V})=\sum_{v\in \mathcal{V}}\pi_v \sum_{u\in \mathcal{V}}{\mathbf{P}}_{vu}\Vert \mathbf{v}-\mathbf{u}\Vert^2. {} \end{aligned} $$
(6.7)

By denoting \(\mathbf {M} = \text{diag}(\pi _1,\pi _2,\dots ,\pi _{|\mathcal {V}|})\), the cost function Eq. (6.7) can be reformulated as

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{V})=2\text{tr}({\mathbf{V}}^{\top}\mathbf{B}\mathbf{V}), {} \end{aligned} $$
(6.8)
$$\displaystyle \begin{aligned} s.t. \ {\mathbf{V}}^{\top}\mathbf{M} \mathbf{V}={\mathbf{I}}_d, {} \end{aligned} $$
(6.9)

where

$$\displaystyle \begin{aligned} \mathbf{B}=\mathbf{M}-\frac{\mathbf{M} \mathbf{P}-{\mathbf{P}}^{\top}\mathbf{M}}{2}. {} \end{aligned} $$
(6.10)

The condition Eq. (6.9) is added to remove an arbitrary scaling factor of solutions. Similar to Laplacian Eigenmap, the optimization problem can also be solved as a generalized eigenvector problem.

Latent Social Dimensions

Latent social dimensions [103] introduce modularity [71] into the cost function instead of minimizing the distance between node representations in previous works. Modularity is a measurement that characterizes how far the graph is away from a uniform random graph. Given \(\mathcal {G}=(\mathcal {V},\mathcal {E})\), we assume that nodes \(\mathcal {V}\) are divided into n nonoverlapping communities. By “uniform random graph,” we mean nodes connect to each other based on a uniform distribution given their degrees. Then, the expected number of edges between v and u is \(\operatorname {deg}(v)\frac {\operatorname {deg}(u)}{2|\mathcal {E}|}\). Then, the modularity Q of a graph is defined as

$$\displaystyle \begin{aligned} Q=\frac{1}{2|\mathcal{E}|}\sum_{v,u}\Big({\mathbf{A}}_{vu}-\frac{\operatorname{deg}(v)\operatorname{deg}(u)}{2|\mathcal{E}|}\Big)\delta(v,u), \end{aligned} $$
(6.11)

where δ(v, u) = 1 if v and u belong to the same community and δ(v, u) = 0 otherwise. Larger modularity indicates that the subgraphs inside communities are denser, which follows the intuition that a community is a dense well-connected cluster. Then, the problem is to find a partition that maximizes the modularity Q.

However, a hard clustering on modularity maximization is proved to be NP-hard. Therefore, they relax the problem to a soft case. Let \(\mathbf {d}\in \mathbb {Z}_+^{|\mathcal {V}|}\) denotes the degrees of all nodes and \(\mathbf {S}\in \{0,1\}^{|\mathcal {V}|\times n}\) denotes the community indicator matrix where Svc = 1 indicates node v belongs to community c and Svc = 0 otherwise. Then, we define modularity matrix B as

$$\displaystyle \begin{aligned} \mathbf{B}=\mathbf{A}-\frac{\mathbf{d}{\mathbf{d}}^{\top}}{2|\mathcal{E}|}, \end{aligned} $$
(6.12)

and modularity Q can be reformulated as

$$\displaystyle \begin{aligned} Q=\frac{1}{2|\mathcal{E}|}\text{tr}({\mathbf{S}}^{\top}\mathbf{B}\mathbf{S}). {} \end{aligned} $$
(6.13)

By relaxing S to a continuous matrix, it has been proved that the optimal solution of S is the top-n eigenvectors of modularity matrix B [70]. Then, the eigenvectors can be used as node representations.

To conclude, spectral clustering-based methods often define a cost function that is linear or quadratic to the node representations. Then, the problems can be reformulated as a matrix form, and then solved by calculating the eigenvectors of the matrix. However, the computation of eigenvectors for large-scale matrices is both time- and space-consuming, limiting these methods from being applied in real-world scenarios.

6.3.2 Shallow Neural Networks

With the success of word2vec [68], many works resort to shallow neural networks for node representation learning. Typically, each node is assigned a vector of trainable parameters as its representation, and the parameters are trained by optimizing a certain objective via gradient descent.

DeepWalk

DeepWalk [81] proposes a novel approach that introduces neural network techniques into graph representation learning for the first time. Compared with the aforementioned methods based on eigenvector computation, DeepWalk provides a faster way to learn low-dimensional node representations. The basic idea of DeepWalk is to adapt the well-known word representation learning algorithm word2vec [68] by regarding nodes as words and random walks as sentences.

Formally, given a graph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\), DeepWalk uses node v to predict its neighboring nodes in short random walk sequences uw, …, u−1, u1, … , uw where w is the window size of skip-gram [68], which can be formulated as

$$\displaystyle \begin{aligned} \min \sum_{j=-w,j\neq 0}^w-\log P(u_{j}|v), \end{aligned} $$
(6.14)

where it assumes the prediction probabilities of each node are independent and the overall loss integrates the losses of all nodes in every random walk.

DeepWalk assigns each node v with two representations: node representation \(\mathbf {v}\in \mathbb {R}^d\) and context representation \({\mathbf {v}}^c\in \mathbb {R}^d\). Then, each probability P(u|v) is formalized as a Softmax function over all nodes:

$$\displaystyle \begin{aligned} P(u|v)=\frac{\exp({\mathbf{v}}^{\top}{\mathbf{u}}^c)}{\sum_{u'\in \mathcal{V}}\exp({\mathbf{v}}^{\top}\mathbf{u'}^c)}. {} \end{aligned} $$
(6.15)

LINE

LINE [102] proposes a graph representation model which can handle large-scale graphs with arbitrary types: (un)directed or weighted. To characterize the interaction between nodes, LINE models the first-order proximity and second-order proximity of the graph.

Before we introduce the details of the algorithm, we can move one step back and see how the idea works. The modeling of first-order proximity, i.e., observed links, is the modeling of the adjacency matrix. As the adjacency matrix is usually too sparse, the modeling of second-order proximity, i.e., nodes with shared neighbors, can serve as complementary information to enrich the adjacency matrix and make it denser.

Formally, first-order proximity between node v and u is defined as the edge weight Avu in the adjacency matrix. If nodes v and u are not connected, then the first-order proximity between them is 0.

Second-order proximity between node v and u is defined as the similarity between their neighbors. Let the row of node v in the adjacency matrix A(v, :) denote the first-order proximity between node v and other nodes. Then, the second-order proximity between v and u is defined as the similarity between A(v, :) and A(u, :). If they have no shared neighbors, the second-order proximity is 0.

To approximate first-order proximity \(\hat {P}_1(v,u)=\frac {{\mathbf {A}}_{vu}}{\sum _{(v',u')\in \mathcal {E}}{\mathbf {A}}_{v'u'}}\), the joint probability between v and u is defined as

$$\displaystyle \begin{aligned} P_1(v,u)=\frac{1}{1+\exp(-{\mathbf{v}}^{\top}\mathbf{u})}. \end{aligned} $$
(6.16)

To approximate second-order proximity \(\hat {P}_2(u|v)=\frac {{\mathbf {A}}_{vu}}{\sum _{u'}{\mathbf {A}}_{vu'}}\), the probability that node u appears in v’s context P2(u|v) is defined in the same form as Eq. (6.15). In this way, the second-order relationship between node representations is bridged by the context representations of their shared neighbors.

For parameter learning, the distances between \(\hat {P}_1(v,u)\) and P1(v, u) as well as \(\hat {P}_2(u|v)\) and P2(u|v) are minimized. In specific, LINE learns node representations for first-order and second-order proximities individually, and then concatenates them as output embeddings. Although LINE can effectively capture both first-order and second-order local topological information, it cannot be easily extended to capture higher-order global topological information.

Connection with Matrix Factorization

We prove that DeepWalk algorithm with the skip-gram model is actually factoring a matrix M where each entry Mij is the logarithm of the average probability that node vi randomly walks to node vj in fixed steps [127]. Intuitively, the matrix M is much denser than the adjacency or Laplacian matrices, and therefore can help representations encode more structural information. In practice, the entry value Mij is estimated by random walk sampling. A more detailed proof can be found at our arxiv note [125]. Moreover, LINE is also proved to be equivalent to matrix factorization [83, 128], where the matrix is defined by the first- and second-order proximities.

To summarize, DeepWalk and LINE introduce shallow neural networks into graph representation learning. Thanks to the modeling ability of neural networks and the efficiency of shallow representations, these methods can outperform conventional graph representation learning methods such as spectral clustering-based algorithms, and are also efficient for large-scale graphs.

6.3.3 Matrix Factorization

Inspired by the connection between DeepWalk and matrix factorization, many research works turn to explore how to learn better node representations by regarding its learning phase as a matrix factorization problem. Note that the eigenvector decomposition in spectral clustering methods can also be seen as a special case of matrix factorization. Here we present two typical matrix factorization-based graph representation learning algorithms, GraRep [12] and TADW [127] in this subsection.

GraRep

GraRep [12] directly follows the proof of matrix factorization form of DeepWalk. According to our proof [125], DeepWalk is actually factorizing a matrix M where \(\mathbf {M}=\log \frac {\mathbf {A}+{\mathbf {A}}^2+\dots +{\mathbf {A}}^K}{K}\). From the matrix factorization form, DeepWalk has considered the high-order topological information of the graph jointly. In contrast, GraRep proposes to regard different k-step information separately in graph representation learning, and can be divided into three steps:

  • Calculate k-step transition probability matrix Ak for each k = 1, 2, …, K.

  • Obtain each k-step node representations.

  • Concatenate all k-step node representations as the final node representations.

For the second step to obtain the k-step node representations, GraRep directly uses a typical matrix decomposition technique, i.e., SVD on Ak. However, this algorithm is not very efficient, especially when k becomes large.

TADW

As the first attributed network embedding algorithm where node features are also available for learning node representations, text-associated DeepWalk (TADW) [127] further generalizes the matrix factorization framework to take advantage of text information. As shown in Fig. 6.5, the main idea of TADW is to factorize node affinity matrix \(\mathbf {M}\in \mathbb {R}^{|\mathcal {V}|\times |\mathcal {V}|}\) into the product of three matrices, \(\mathbf {W}\in \mathbb {R}^{d\times |\mathcal {V}|}\), \(\mathbf {H}\in \mathbb {R}^{d\times f_t}\), and text feature matrix \(\mathbf {T} \in \mathbb {R}^{f_t\times |\mathcal {V}|}\), where d and ft are the rank and feature dimensions, respectively. Then, TADW concatenates W and HT as 2d-dimensional node representations \(\mathbf {V}=\operatorname {concat}(\mathbf {W};\mathbf {HT})\).

Fig. 6.5
An illustration depicts a node affinity matrix, which is indicated by a square component labeled M, which leads to the product of three matrices such as W superscript T, H, and T.

Illustration of text-associated DeepWalk (TADW)

Now the question is how to build node affinity matrix M and how to extract text feature matrix T from the text information. Following the proof of matrix factorization form of DeepWalk, TADW set node affinity matrix M to a trade-off between efficiency and effectiveness: factorizing the matrix M = (A + A2)∕2 where A is the row-normalized adjacency matrix. In this way, M can encode both first-order and second-order proximities. For text feature matrix T, TADW first constructs the TF-IDF matrix from the text of nodes, and then reduces the dimension of the TF-IDF matrix via SVD decomposition.

Formally, the model of TADW minimizes the following optimization function:

$$\displaystyle \begin{aligned} \min_{\mathbf{W},\mathbf{H}} ||\mathbf{M}-{\mathbf{W}}^{\top}\mathbf{H}\mathbf{T}||{}_F^2+\frac{\lambda}{2}(||\mathbf{W}||{}_F^2+||\mathbf{H}||{}_F^2), \end{aligned} $$
(6.17)

where λ is the regularization factor and ||⋅||F is the Frobenius norm. The optimization of parameters is processed by updating W and H iteratively via conjugate gradient descent.

To summarize, the shallow graph representation learning methods have shown outstanding ability in modeling various kinds of graphs, such as large-scale [153] or heterogeneous [110] graphs. They are also used in many application scenarios such as recommendation systems [91, 129]. However, they still have several drawbacks [38]. Firstly, the model capacity of shallow methods is usually limited, and leads to suboptimal performance in complex scenarios. Secondly, the representations of different nodes share no parameters, which makes the number of parameters grow linearly with the number of nodes. This leads to the problem of computational inefficiency. Thirdly, the shallow graph representation methods are typically transductive, and thus cannot generalize to new nodes in a dynamic graph without retraining.

6.4 Deep Node Representation Learning

To address the problems of shallow node representations, researchers propose to utilize deep neural networks to aggregate information from graph structure. In this section, we introduce several typical solutions for node representation learning, including autoencoder-based methods, graph convolutional networks, graph attention networks, graph recurrent networks, graph Transformers, and several extensions based on them. Most deep node representation learning methods assume node features are available, and stack multiple neural network layers for representation learning. Here we denote the initial feature vector of node v as xv and the hidden representation of node v at the k-th layer as \({\mathbf {h}}_v^k\).

6.4.1 Autoencoder-Based Methods

Different from previous methods that use shallow neural networks to characterize the graph representations, structural deep network embedding (SDNE) [109] employs a deep autoencoder to model more complex relationships between node representations. The main body of SDNE is a deep autoencoder whose input and output vectors are the initial node feature xv and reconstructed feature \(\hat {\mathbf {x}}_v\), respectively. The algorithm takes the representations from the intermediate layer as node embeddings and encourages the embeddings of connected nodes to be similar.

Formally, a deep autoencoder first compresses the input node feature into a low-dimensional intermediate vector and then tries to reconstruct the original input from the low-dimensional intermediate vector. Hence, the intermediate hidden representation can capture the information of the input node since we can recover the input features from them. Assume the input vector is xv, then the hidden representation of each layer k is defined as

$$\displaystyle \begin{aligned} {\mathbf{h}}_v^{1}&=\operatorname{Sigmoid}({\mathbf{W}}^{1}{\mathbf{x}}_v+{\mathbf{b}}^{1}),\\ {\mathbf{h}}_v^{k}&=\operatorname{Sigmoid}({\mathbf{W}}^{k}{\mathbf{h}}_v^{k-1}+{\mathbf{b}}^{k}),k=2,3\dots ,K\dots, \end{aligned} $$
(6.18)

where Wk and bk are weighted matrix and bias vector of the k-th layer. We assume that the hidden representation of the K-th layer has the minimum dimension, and the intermediate layer \({\mathbf {h}}_v^{K}\) can be seen as the low-dimensional representation of node v. Afterward, we can compute the output \(\hat {\mathbf {x}}_v\) by applying the reversed calculation process on \({\mathbf {h}}_v^{K}\). The optimization objective of the autoencoder is to minimize the difference between input vector xv and output vector \(\hat {\mathbf {x}}_v\):

$$\displaystyle \begin{aligned} \mathcal{L}_1=\sum_{v\in \mathcal{V}}\Vert\hat{\mathbf{x}}_v-{\mathbf{x}}_v\Vert^2. \end{aligned} $$
(6.19)

To encode the structure information, SDNE simply requires that the representations of connected nodes should be close to each other. Thus, the loss function is

$$\displaystyle \begin{aligned} \mathcal{L}_2=\sum_{(v,u)\in \mathcal{E}}\Vert{\mathbf{h}}_v^{K}-{\mathbf{h}}_u^{K}\Vert^2. \end{aligned} $$
(6.20)

Finally, the overall loss function is \(\mathcal {L}=\mathcal {L}_{1}+\alpha \mathcal {L}_{2}\), where α is a harmonic hyper-parameter. After the training process, \({\mathbf {h}}_v^{K}\) is taken as the representation of node v and used for downstream tasks.

Experimental results show that SDNE can effectively reconstruct the input graph and achieve better results in several downstream tasks. However, the deep neural network part of SDNE, i.e., the deep autoencoder, is isolated with graph structure during the feed-forward computation, which neglects high-order information interaction among nodes.

6.4.2 Graph Convolutional Networks

Graph convolutional networks (GCNs) aim to generalize convolutional operation from CNNs [55] to the graph domain. The success of CNNs comes from its local connection and multilayer [54] architectures, which may benefit graph modeling as well: (1) graphs are also locally connected; (2) multilayer architectures can help capture the hierarchical patterns in the graph. However, CNNs can only operate on regular Euclidean data like text (1D sequence) and images (2D grid), and cannot be directly transferred to the graph structure. In this subsection, we introduce how GCNs extend the convolutional operation to deal with the non-Euclidean graph data.

Mainstream GCNs usually adopt semi-supervised settings for training, while previous graph embedding methods are mostly unsupervised or self-supervised. Here we only introduce the encoder architectures of GCNs, and omit their loss functions which depend on downstream tasks. In specific, typical GCNs can be divided into spectral and spatial (nonspectral) approaches.

Spectral Methods

From the signal processing perspective, the convolutional operation first transforms a signal to the spectral domain, then modifies the signal with a filter, and finally projects the signal back to the original domain [63]. Spectral GCNs follow the same process and define the convolution operator in the spectral domain of graph signals.

Formally, d-dimensional input representations \(\mathbf {X}\in \mathbb {R}^{|\mathcal {V}|\times d}\) of a graph can be seen as d graph signals. Then, spectral GCNs are formulated as

$$\displaystyle \begin{aligned} \mathbf{H} = \mathcal{F}^{-1}(\mathcal{F}(\mathbf{g})\odot\mathcal{F}(\mathbf{X})), \end{aligned} $$
(6.21)

where H is the representations of all nodes, g is the filter in the spatial domain, \(\mathcal {F}(\cdot )\) and \(\mathcal {F}^{-1}(\cdot )\) indicate the graph Fourier transform (GFT) [9] and inverse GFT respectively, which can be defined as

$$\displaystyle \begin{aligned} \mathcal{F}(\mathbf{X}) = {\mathbf{U}}^{\top}\mathbf{X}, \mathcal{F}^{-1}(\mathbf{X}) = \mathbf{U}\mathbf{X}, \end{aligned} $$
(6.22)

where U is the eigenvector matrix of the normalized graph Laplacian \(\mathbf {L} = {\mathbf {I}}_{|\mathcal {V}|} - {\mathbf {D}}^{-\frac {1}{2}} \mathbf {A} {\mathbf {D}}^{-\frac {1}{2}}\). A is adjacency matrix and D is degree matrix.

In practice, we can use a learnable diagonal matrix gθ to approximate the spectral graph filter \(\mathcal {F}(\mathbf {g})\), and the graph convolutional operation can be reformulated as

$$\displaystyle \begin{aligned} \mathbf{H} = \mathbf{U}{\mathbf{g}}_{\theta}{\mathbf{U}}^{\top}\mathbf{X}. \end{aligned} $$
(6.23)

Intuitively, initial graph signals X are transformed into spectral domain by multiplying U. Then filter gθ performs the convolution, and U projects graph signals back to their original space. The above form of graph convolution is used in spectral network [10], the first spectral GCN method.

However, the original form in Eq. (6.23) has several limitations: (1) the filter gθ is not directly related to the graph structure; (2) the kernel size of the graph filter grows with the number of nodes in the graph, which may cause inefficiency and overfitting issues; (3) the calculation of U relies on computationally inefficient matrix factorization. Now we introduce two typical spectral GCNs improving the original formats, including ChebNet [23] and GCN [52].

ChebNet

ChebNet [23] proposes to approximate UgθU by Chebyshev polynomials Tk(⋅) up to Kth-order, which involve information within K-hop neighborhood. In this way, ChebNet does not need to compute matrix U, and the number of learnable parameters is related to K instead of \(|\mathcal {V}|\). Formally, the operation of ChebNet is reformulated as

$$\displaystyle \begin{aligned} \mathbf{H} = \sum_{k=0}^K {\theta}_k {\mathbf{T}}_k(\tilde{\mathbf{L}})\mathbf{X}, \end{aligned} $$
(6.24)

where \(\tilde {\mathbf {L}}={2}/{\lambda _{max}}\mathbf {L} - {\mathbf {I}}_{|\mathcal {V}|}\) is the normalized Laplacian matrix, λmax is the largest eigenvalue of L, and \(\theta \in \mathbb {R}^K\) is a weight vector indicating Chebyshev coefficients. The Chebyshev polynomials are defined as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{T}}_0(\tilde{\mathbf{L}}) & =&\displaystyle {\mathbf{I}}_{|\mathcal{V}|}, \\ {\mathbf{T}}_1(\tilde{\mathbf{L}}) & =&\displaystyle \tilde{\mathbf{L}}, \\ {\mathbf{T}}_k(\tilde{\mathbf{L}}) & =&\displaystyle 2\tilde{\mathbf{L}}{\mathbf{T}}_{k-1}(\tilde{\mathbf{L}}) - {\mathbf{T}}_{k-2}(\tilde{\mathbf{L}}). \end{array} \end{aligned} $$
(6.25)

The Kth-order polynomial can be efficiently computed in a recursive manner, and the parameter number of the graph filter is reduced to K.

GCN

GCN [52] is a first-order approximation of ChebNet. GCN reveals that ChebNet may suffer from the overfitting problem when handling graphs with very wide node degree distributions. Hence, GCN limits the maximum order of Chebyshev polynomials to K = 1, and the equation is simplified to the following form with two trainable scalars \(\theta _0^{\prime }\) and \(\theta _1^{\prime }\):

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbf{H} & =&\displaystyle \theta_0^{\prime} \mathbf{X} + \theta_1^{\prime} \left(\mathbf{L}-{\mathbf{I}}_{|\mathcal{V}|}\right)\mathbf{X} \\ & =&\displaystyle \theta_0^{\prime} \mathbf{X} - \theta_1^{\prime} {\mathbf{D}}^{-\frac{1}{2}}\mathbf{A}{\mathbf{D}}^{-\frac{1}{2}} \mathbf{X}. \end{array} \end{aligned} $$
(6.26)

GCN further reduces the number of parameters to address overfitting by setting \(\theta _0^{\prime }=-\theta _1^{\prime }=\theta \). And the equation is reformulated as

$$\displaystyle \begin{aligned} \mathbf{H} = \theta \left({\mathbf{I}}_{|\mathcal{V}|} + {\mathbf{D}}^{-\frac{1}{2}}\mathbf{A}{\mathbf{D}}^{-\frac{1}{2}}\right) \mathbf{X}. \end{aligned} $$
(6.27)

To summarize, spectral GCNs can effectively capture complex global patterns in graphs with spectral graph filters. However, the learned filters of the spectral methods usually depend on the Laplacian eigenbasis of the graph structure. This leads to the problem that a spectral-based GCN model learned on a specific graph cannot be directly transferred to another graph.

Spatial Methods

Different from spectral GCNs, spatial GCNs define graph convolutional operation by directly aggregating information on spatially close neighbors, which is also known as the message-passing process. As shown in Fig. 6.6, the representation \({\mathbf {h}}^{k}_{v}\) of node v at k-th layer can be seen as a function aggregating the representations of v and its neighbors at (k − 1)-th layer:

$$\displaystyle \begin{aligned} {\mathbf{h}}^{k}_{v}=f(\{{\mathbf{h}}^{k-1}_v\} \cup \{{\mathbf{h}}_u^{k-1}| \forall u \in \mathcal{N}_v\}), \end{aligned} $$
(6.28)

where \(\mathcal {N}_v\) is the neighbor set of node v.

Fig. 6.6
An illustration depicts two layers, k minus 1 and k, each consisting of a network of interconnected nodes. The node aggregates of the first layer interact with a single node aggregate of the second layer.

Illustration of spatial GCNs. In the feed-forward computation, each node aggregates information from the representations of its neighbors and itself

The key challenge of spatial GCNs is how to define the convolutional operation to satisfy the nodes with different degrees while maintaining the local invariance of CNNs. In this subsection, we introduce two widely used spatial GCNs, including Neural FPs and GraphSAGE.

Neural FPs

Neural FPs [27] propose to use different weight matrices for nodes with different-sized neighborhoods:

$$\displaystyle \begin{aligned} {\mathbf{z}}_v^{k} &= {\mathbf{h}}_{v}^{k-1} + \sum_{u\in \mathcal{N}_v} {\mathbf{h}}_u^{k-1},\\ {\mathbf{h}}^{k}_{v} &= \sigma ( {\mathbf{W}}_{|\mathcal{N}_v|}^{k}{\mathbf{z}}^{k}), \end{aligned} $$
(6.29)

where \({\mathbf {W}}_{|{\mathcal {N}}_v|}^{k}\) is the weight matrix for nodes with degree \(|\mathcal {N}_v|\) at layer k and σ(⋅) is a nonlinear function such as Sigmoid. Neural FPs require learning weight matrices for all node degrees in the graph. Hence, when applied to large-scale graphs with diverse node degrees, it cannot capture the invariant information among different node degrees, and needs more parameters as node degrees get larger.

GraphSAGE

GraphSAGE [37] transfers GCNs to handle the inductive setting, where the representations of new nodes should be computed without retraining. Instead of utilizing the full set of neighbors, GraphSAGE learns graph representations by uniformly sampling a fixed-size neighbor set from each node’s local neighborhood. GraphSAGE can be formulated as

$$\displaystyle \begin{aligned} {\mathbf{h}}_{{\mathcal{N}}_v}^{k} &= {\operatorname{Aggregate}} (\{{\mathbf{h}}_u^{k-1}| \forall u\in {\mathcal{N}}_v\} ), \\ {\mathbf{h}}_{v}^{k} &= \sigma ({\mathbf{W}}^{k} \operatorname{concat}({\mathbf{h}}_{v}^{k-1} ; {\mathbf{h}}_{\mathcal{N}_v}^{k} ) ), \end{aligned} $$
(6.30)

where \(\mathcal {N}_v\) is the sampled neighbor set of node v and the aggregator functions \(\operatorname {Aggregate}(\cdot )\) usually utilize the following three types:

  1. 1.

    Mean aggregator. By utilizing a mean-pooling aggregator, GraphSAGE can be viewed as the inductive version of the original transductive GCN framework [52], which can be formulated as

    $$\displaystyle \begin{aligned} {\mathbf{h}}^{k}_v = \sigma\left({\mathbf{W}}^k\cdot{\operatorname{Mean}}\left(\{{\mathbf{h}}^{k-1}_v\} \cup \{{\mathbf{h}}_u^{k-1}| \forall u \in {{\mathcal{N}}_v}\}\right)\right). \end{aligned} $$
    (6.31)
  2. 2.

    Max-pooling aggregator. Max-pooling aggregator first feeds each neighbor’s hidden representation into a fully connected layer and then utilizes a max-pooling operation to the obtained representations of the node’s neighbors. It can be formulated as

    $$\displaystyle \begin{aligned} {\mathbf{h}}_{{\mathcal{N}}_v}^{k} = \operatorname{Max}(\{\sigma({\mathbf{W}}^k{\mathbf{h}}^{k-1}_{u} + {\mathbf{b}}^k)| \forall u \in {\mathcal{N}}_v\}). \end{aligned} $$
    (6.32)
  3. 3.

    LSTM aggregator. GraphSAGE also proposes to use an LSTM-based aggregator with a stronger expressive capability. Since LSTMs process inputs sequentially, GraphSAGE randomly permutes node v’s neighbors to adapt LSTMs.

In summary, GCNs extend the idea of CNNs into the graph domain, enabling models to capture local connectivity of the graph, and have shown their superior abilities in a wide range of downstream tasks compared to the previous autoencoder-based methods. Even now, we can still get state-of-the-art performance by equipping vanilla GCNs with proper training strategies [48] or knowledge distillation methods [123, 124].

6.4.3 Graph Attention Networks

The attention mechanism has shown its strong ability to consider instance importance in learning representations in many NLP applications, such as machine translation [3, 33, 105] and machine reading [18]. Hence, many works [106, 146] focus on generalizing the attention mechanism to the graph domain and hope graph neural networks (GNNs) with attention-based operators can achieve better results by considering the importance of neighbors. Figure 6.7 illustrates the architecture of graph attention networks (GATs).

Fig. 6.7
An illustration depicts two layers, k minus 1 and k, each consisting of a network of interconnected nodes. The node aggregates of the first layer interact with a single node aggregate of the second layer. The second layer consists of multiple heads.

Illustration of GATs. Compared with spatial GCNs, GATs assign different aggregation weights to different neighbors, and employ multiple parallel attention heads for computation

GAT

GAT [106] proposes to adopt the self-attention mechanism for the information aggregation of GNNs. Specifically, each node’s representation is calculated by attending to its neighbors:

$$\displaystyle \begin{aligned} {\mathbf{h}}^{k}_v = \sigma\left(\sum_{u\in {\mathcal{N}}_v\cup\{v\} } \alpha_{vu}^k {\mathbf{W}}^k{\mathbf{h}}_u^{k-1}\right), \end{aligned} $$
(6.33)

where σ(⋅) is a nonlinear function and \(\alpha _{vu}^k\) is the attention coefficient of node pair (v, u) at the k-th layer, which is normalized over node v’s neighbors:

$$\displaystyle \begin{aligned} \alpha_{vu}^k = \frac{\exp\left(\text{LeakyReLU}\left({\mathbf{a}}^{\top}\operatorname{concat}({\mathbf{W}}^k{\mathbf{h}}_v^{k-1};{\mathbf{W}}^k{\mathbf{h}}_u^{k-1})\right)\right)}{\sum_{m\in{\mathcal{N}}_v\cup\{v\}} \exp\left(\text{LeakyReLU}\left({\mathbf{a}}^{\top}\operatorname{concat}({\mathbf{W}}^k{\mathbf{h}}_v^{k-1};{\mathbf{W}}^k{\mathbf{h}}_m^{k-1})\right)\right)}, \end{aligned} $$
(6.34)

where Wk is the weight matrix of a shared linear transformation applied to every node, a is a learnable weight vector and LeakyReLU(⋅) is a nonlinear function.

Moreover, GAT utilizes the multi-head attention mechanism similar to [105] to further aggregate different types of information. Specifically, GAT concatenates (or averages) the output representations of M independent attention heads:

$$\displaystyle \begin{aligned} {\mathbf{h}}^{k}_v = \|{}_{m=1}^M \sigma\left(\sum_{u\in{\mathcal{N}}_v\cup\{v\}}\alpha_{vu}^{k,m}{\mathbf{W}}_m^k {\mathbf{h}}_u^{k-1}\right), \end{aligned} $$
(6.35)

where \(\alpha _{vu}^{k,m}\) is the attention coefficient from the m-th attention head at the k-th layer, \({\mathbf {W}}_m^k\) is the transform matrix for the m-th attention head, and ∥ is the concatenation operation.

To summarize, by incorporating the attention mechanism into the information aggregating phase of spatial GCNs, GATs can assign proper weights to different neighbors and offer better interpretability. The ensemble of multiple attention heads further increases the model capacity and brings performance gains over GCNs.

6.4.4 Graph Recurrent Networks

To capture the dependency between two distant nodes by a graph encoder, one has to stack many GNN layers so that the information can propagate from one to another. However, stacking too many GNN layers in the feed-forward computation will cause the over-smoothing issue, which makes node representations less discriminative and harms the performance. Inspired by the success of gated recurrent unit (GRU) [19] and LSTM [42] in modeling long-term dependency in NLP, graph recurrent networks (GRNs) propose to equip the information aggregation of GCNs with gate mechanisms. Similar to the usage in RNNs, the gate mechanisms allow the information to propagate farther without severe gradient vanishment or over-smoothing issues. In this way, GRNs can improve the model’s ability in capturing the long-range dependency across the graph. In this subsection, we introduce several variants of GRNs, including GGNN, Tree-LSTM, and Graph LSTM.

Gated Graph Neural Network (GGNN)

GGNN [57] introduces a GRU-like function to improve the information propagation of the vanilla GCN architecture. In each layer, GGNN updates the representations of nodes by combining the information of their neighbors and themselves with the update and reset gates. Specifically, the recurrence of each layer in GGNN is defined as

$$\displaystyle \begin{aligned} {\mathbf{a}}_v^{k} &=\sum_{u \in \mathcal{N}_v} {\mathbf{h}}_u^{k-1}+ \mathbf{b}, \notag \\ {\mathbf{z}}_v^{k} &= \operatorname{Sigmoid}\left({\mathbf{W}}_z{{\mathbf{a}}_v^{k}}+{\mathbf{U}}_z{\mathbf{h}}_v^{k-1}\right), \notag \\ {\mathbf{r}}_v^{k} &= \operatorname{Sigmoid}\left({\mathbf{W}}_r{\mathbf{a}}_v^{k}+{\mathbf{U}}_r{\mathbf{h}}_v^{k-1}\right), \\ \widetilde{\mathbf{h}}_v^{k} &= \tanh\left({\mathbf{W}}{{\mathbf{a}}_v^{k}}+{\mathbf{U}}({{\mathbf{r}}_v^{k}}\odot{{\mathbf{h}}_v^{k-1}})\right) \notag ,\\ {\mathbf{h}}_v^{k} &= (1-{{\mathbf{z}}_v^{k}}) \odot{{\mathbf{h}}_v^{k-1}}+{{\mathbf{z}}_v^{k}}\odot{\widetilde{{\mathbf{h}}}_v^{k}}, \notag \end{aligned} $$
(6.36)

where \({\mathbf {a}}_{v}^k\) represents node v’s neighborhood information, b is the bias vector, \(\widetilde {\mathbf {h}}_v^{k}\) is the candidate representation, z and r are the update and reset gates, and W and U are weight matrices.

Typical RNNs can also be seen as a special case of GGNN, where the graph is a chain structure. Hence, GRNs like GGNN have been widely used in language models. Note that tree structure is very popular in text data, such as the dependency parsing tree. Next, we introduce Tree-LSTM [101], which extends GRNs to model the tree structure.

Tree-LSTM

Tree-LSTM [101] uses an LSTM-based unit with input/output gates iv/ov and memory cell cv to update representation hv of each tree node v. There are two variants of Tree-LSTM: Child-sum Tree-LSTM and N-ary Tree-LSTM. Instead of using a unified forget gate like LSTM, Child-sum Tree-LSTM assigns a forget gate fvm for each child m of node v, which allows tree node v to adaptively gather information from its children. N-ary Tree-LSTM requires each node to have at most N children, and assigns different learnable parameters for each child. Child-sum Tree-LSTM is suitable for trees whose children are unordered, and thus can be used for modeling dependency trees. N-ary Tree-LSTM can characterize the diverse relational information for each node’s children, and thus is usually used to model constituency parsing tree structure. Here we only present the formulas of Child-sum Tree-LSTM:

$$\displaystyle \begin{aligned} \widetilde{\mathbf{h}}_v^{k-1} &= \sum_{m \in {\mathcal{N}}_v^c} {\mathbf{h}}_m^{k-1} \notag ,\\ {\mathbf{i}}_v^{k} &= \operatorname{Sigmoid} \Big( {\mathbf{W}}_i {\mathbf{x}}_v + {\mathbf{U}}_i \widetilde{\mathbf{h}}_v^{k-1} + {\mathbf{b}}_i \Big) \notag ,\\ {\mathbf{f}}_{vm}^k &= \operatorname{Sigmoid} \Big( {\mathbf{W}}_f {\mathbf{x}}_v + {\mathbf{U}}_f {\mathbf{h}}_m^{k-1} + {\mathbf{b}}_f \Big) \notag ,\\ {\mathbf{o}}_v^{k} &= \operatorname{Sigmoid} \Big( {\mathbf{W}}_o {\mathbf{x}}_v + {\mathbf{U}}_o \widetilde{{\mathbf{h}}}_v^{k-1} + {\mathbf{b}}_o \Big) ,\\ {\mathbf{u}}_v^k &= \tanh \Big({\mathbf{W}}_u {\mathbf{x}}_v + {\mathbf{U}}_u \widetilde{{\mathbf{h}}}_v^{k-1} + {\mathbf{b}}_u \Big) \notag ,\\ {\mathbf{c}}_v^k &= {\mathbf{i}}_v^k \odot {\mathbf{u}}_v^k + \sum_{m \in {\mathcal{N}}_v} {\mathbf{f}}_{vm}^k \odot {\mathbf{c}}_m^{k-1} \notag ,\\ {\mathbf{h}}_v^k &= {\mathbf{o}}_v^k \odot \tanh({\mathbf{c}}_v^k) \notag , \end{aligned} $$
(6.37)

where \({\mathcal {N}}_v^c\) is the children set of node v and xv is the input representation for tree node v. Readers can refer to the original paper [101] for the details of N-ary Tree-LSTM variant.

Graph LSTM

Graph LSTM [78, 144] proposes to adapt Tree-LSTM to model the graph structure, and utilizes different weight matrices to represent different labels on the edges. Formally, assume that the edge label between node v and its child m is l. Compared with Eq. (6.37) in Tree-LSTM, Graph LSTM uses label-specific weight matrix Ul to compute relevant gates and hidden states.

In summary, GRNs with gate mechanisms can effectively model the long-range dependencies between distant nodes, which is very important in text modeling. By adapting to tree-structure data, GRNs can also handle the diverse syntactic and semantic relations in text.

6.4.5 Graph Transformers

Transformer [105] has set off a craze in both NLP and CV areas [26, 84]. Based on the powerful self-attention architecture, graph Transformer networks [53, 87, 138] have been proposed to improve the expressive ability of GNNs. The core idea is to leverage the Transformer architecture to capture long-range relationships between distant nodes. Compared with GRNs, graph Transformers can benefit the advantages of Transformers against LSTMs.

Connections with Transformers in Text Modeling

In graph Transformers, nodes are the basic units instead of words, and self-attention mechanism is then performed between all node pairs. In this way, all nodes are directly connected regardless of the original graph structure. To utilize the original topology information, current graph Transformers focus on the modifications of input features and attention coefficients. Other operations including multi-head ensembling, feed-forward network, and layer normalization remain unchanged. Now we will introduce three typical graph Transformers, including Graphormer [138], GraphTrans [116], and SAT [15].

Graphormer

Graphormer [138] designs three structural encoding modules to inject graph structure information into the Transformer architecture. In specific, centrality encoding adds node degrees to the input which can indicate the importance of different nodes:

$$\displaystyle \begin{aligned} {\mathbf{h}}^{0}_v = {\mathbf{x}}_v + {\mathbf{z}}^-_{\text{deg}^-(v)} + {\mathbf{z}}^+_{\text{deg}^+(v)}, \end{aligned} $$
(6.38)

where xv is the feature vector of node v and z, z+ are learnable embedding vectors indexed by node in-degree and out-degree.

Spatial encoding of node distances and edge encoding of edge features serve as bias terms for the attention coefficients in the self-attention layer:

$$\displaystyle \begin{aligned} \alpha_{vu}^k= ({\mathbf{W}}_Q{\mathbf{h}}_v^{k-1})^{\top}({\mathbf{W}}_K{\mathbf{h}}_u^{k-1})/\sqrt {d}+ {b}_{\phi(v,u)}+c_{vu}, \end{aligned} $$
(6.39)

where \(\alpha _{vu}^k\) is the attention coefficient between node v and u in the self-attention layer, WQ, WK are weight matrices, d is the hidden dimension, bϕ(v,u) is the learnable scalar indexed by the distance ϕ(v, u) between v and u, and cvu is the scalar derived from the edge features between v and u.

To take advantage of deep models in utilizing structure information, GraphTrans [116] and SAT [15] propose to integrate GNN and Transformer architectures for modeling.

GraphTrans

GraphTrans [116] directly stacks a Transformer module on the top of a GNN module. In other words, the output node representations of the GNN are used as the input features of the Transformer. GraphTrans adopts a special [CLS] token which connects to all other nodes, and the representation of [CLS] token after the Transformer is taken as the graph representation. Hence, the Transformer can also be seen as a pooling operator for GNN. As a result, GraphTrans can capture both the local structured information and long-range relationships on graphs at the same time.

SAT

SAT [15] proves that modifying the position encoding module in standard Transformers could not fully capture the structural similarity between nodes on a graph. To address this problem, SAT defines attention coefficients in the Transformer by the similarity of GNN-based representations. In this way, GNN serves as a structure extractor to integrate more information into self-attention layers.

In summary, graph Transformer architecture can model long-range relationships on a graph and go beyond the limitations of traditional deep GNNs, such as over-smoothing. Compared with GNNs limited by the Weisfeiler-Lehman test [119], the Transformer-based methods become more expressive. Besides, GNNs can also be integrated into graph Transformers to better utilize the graph topology information.

6.4.6 Extensions

In this subsection, we will talk about several typical extensions of GNNs, including skip connection, neighborhood sampling, and the modeling of diverse graph types, which can respectively improve the effectiveness, efficiency, and generalizability of GNNs.

GNNs with Skip Connection

Theoretically, we can enhance the expressive ability by stacking more layers of GNNs. However, existing works find that deeper GNNs do not perform better in downstream tasks and even perform worse [52]. Chen et al. [14] further attribute this phenomenon to the low information-noise ratio received by the nodes in deep GNNs. The residual network [40], which has been verified in the computer vision community, is a straightforward solution to the problem. But researchers find that deep GNNs with residual connections still perform worse compared to the two-layer GNNs. Therefore, Rahimi et al. [85] and Xu et al. [120] further explore how to enhance the performance of GNNs with skip connections. Inspired by the idea from the highway network [160], Rahimi et al. [85] employ the layer-wise gate mechanism, and the performance can peak at four layers. Formally, the Highway GCN can be defined as

$$\displaystyle \begin{aligned} \mathbf{T}({\mathbf{h}}^{k-1}_v) &= \operatorname{Sigmoid} \left( {\mathbf{W}}^{k-1} {\mathbf{h}}^{k-1}_v + {\mathbf{b}}^{k-1} \right),\\ {\mathbf{h}}^{k}_v &= {\mathbf{h}}^{k}_v \odot \mathbf{T}({\mathbf{h}}^{k-1}_v) + {\mathbf{h}}^{k-1}_v \odot (1 - \mathbf{T}({\mathbf{h}}^{k-1}_v)). \end{aligned} $$
(6.40)

Besides, Xu et al. [120] present the jump knowledge network, which selects representations from all intermediate layers as final node representations. The selecting mechanism enables the jump knowledge network to adaptively pick the reasonable neighborhood information for each node.

GNNs with Neighborhood Sampling

The vanilla GNN [89] has several limitations: (1) the computation is based on the entire graph Laplacian matrix, and thus computationally expensive for large graphs; (2) it is trained and specialized for a given graph, and cannot be transferred to another graph. To address the problems, the aforementioned GraphSAGE [37] first samples neighborhood nodes of a target node, and then aggregates the representations of all sampled nodes. Thus, GraphSAGE can get rid of the graph Laplacian and can be applied to unseen nodes. Ying et al. [139] further propose the importance-based sampling method PinSage, which simulates random walks starting from target nodes and samples the neighborhood set according to the normalized visit counts. Instead of sampling neighbors for each node, Chen et al. [16] propose FastGCN, which directly samples the receptive field for each layer. In other words, only sampled nodes can participate in layer-wise information propagation. FastGCN sets the sampling importance of nodes according to their degrees, and tends to keep the nodes with larger degrees. Besides, Huang et al. [45] introduce a parameterized and trainable sampler, which performs layer-wise sampling based on the previous layer.

GNNs for Graphs of Diverse Types

The vanilla GNN [89] is designed for undirected graphs, the simplest graph format. However, as shown in Fig. 6.8, we have graphs of diverse types in real-world scenarios. Now we will introduce several extensions of GNNs to deal with directed, heterogeneous, or dynamic graphs:

Fig. 6.8
An illustration depicts different types of graphs based on the directed edge, edge type, and time as follows. Directed graph, heterogeneous graph, and dynamic graph.

Diverse types of graphs

Directed Graphs

Directed edges contain extra information compared to undirected ones. For example, a famous person in a social network may be followed by lots of other users. Usually, her influence is stronger than most of her followers. This suggests that GNNs should treat the information propagation process in two edge directions differently. DGP [49] defines two graphs where nodes are respectively connected to all their ancestors or descendants. We correspondingly denote the normalized adjacency matrices as \({\mathbf {D}}^{-1}_p{\mathbf {A}}_p\) and \({\mathbf {D}}^{-1}_c{\mathbf {A}}_c\). The encoder of DGP can be formulated as

$$\displaystyle \begin{aligned} {\mathbf{H}}^{k} = \sigma({\mathbf{D}}^{-1}_p{\mathbf{A}}_p \sigma({\mathbf{D}}^{-1}_c{\mathbf{A}}_c{\mathbf{H}}^{k-1}{\mathbf{W}}_c){\mathbf{W}}_p), \end{aligned} $$
(6.41)

where Hk is the representation matrix of all nodes in the k-th layer, Wp and Wc are weight matrices, and σ(⋅) is a nonlinear function. In this way, the information propagations of two directions are processed separately.

Heterogeneous Graphs

A heterogeneous graph [92] has several kinds of nodes. For example, for the graph in the shopping recommendation system, we have user nodes, item nodes, etc. The simplest way to process heterogeneous graphs is to consider the node type information in their input features, i.e., convert the node type information into a one-hot feature vector and concatenate it to the original features. Besides the simple feature-based approach, GraphInception [150] introduces the concept of meta-path into the information propagation on heterogeneous graphs. GraphInception utilizes GNNs to perform information propagation on homogeneous subgraphs, which are extracted based on human-designed meta-paths. At last, it concatenates the outputs from different homogeneous GNNs as final node representations. Wang et al. [111] further propose the heterogeneous graph attention network (HAN) with node-level and meta-path-level attention mechanisms. In addition, some works [112, 154] also consider the modeling of network schema [99], which is a meta template of a heterogeneous graph indicating node types and their relations.

Besides, there are many graphs containing edges with weights or types as additional information. A typical way to handle such graphs is to build a bipartite graph, where the original edges are converted into nodes linking to the original endpoint nodes, and the type information is thus converted to node type information. Another way is to assign different propagation weight matrices for different edge types. However, if a graph has lots of edge types, the parameter numbers will be large. To address the problem, R-GCN [90] introduces two regularization tricks that can decompose the transformation matrix Wr of type r to a set of base transformations shared among all edge types. Thus, R-GCN can reduce the number of parameters and capture the relationship between different edge types.

Dynamic Graphs

Graphs are usually dynamic and vary over time. For example, a user in a social network may newly follow another user. To model the graph structure changing over time, DCRNN [56] and STGCN [143] first capture the static graph information at each time step by GNNs and then feed the output representation into a sequence model like RNNs. In addition, Structural-RNN [47] and ST-GCN [122] extend static graph structure with temporal connections and apply conventional GNNs on the extended graphs, which can capture structural and temporal information at the same time. MetaDyGNN [131] combines GNNs with meta-learning for few-shot link prediction in dynamic graphs.

6.5 From Node Representation to Graph Representation

In previous sections, we introduce how to represent nodes in a graph, from shallow node representation to deep node representation. In many scenarios, we also need to compute the representation of an entire graph or a specific subgraph. Inspired by the pooling operation in NLP and CV areas, graph pooling is then designed for obtaining the graph representation from node representations. Here we present two typical groups of graph pooling methods, namely, flat pooling and hierarchical pooling.

6.5.1 Flat Pooling

Flat pooling assumes a flat graph structure to generate graph representation, which includes max/mean/sum pooling as simple node pooling methods, and SortPooling [147] considering node ordering of structural roles.

Simple Node Pooling

Similar to the pooling operation in NLP and CV, we can directly apply node-wise max/mean/sum operators on top of node representations for graph pooling. The graph representation can be formulated as

$$\displaystyle \begin{aligned} \left\{ \begin{aligned} \max_{v\in \mathcal{V}} \mathbf{v}, \quad \quad &\text{(Graph max-pooling)}\\ \frac{1}{|\mathcal{V}|}\sum_{v\in \mathcal{V}} \mathbf{v}, \quad \quad &\text{(Graph mean-pooling)}\\ \sum_{v\in \mathcal{V}} \mathbf{v}. \quad \quad &\text{(Graph sum-pooling)} \end{aligned} \right. \end{aligned} $$
(6.42)

The above pooling operators are general and parameter-free, but completely neglect the graph structure.

SortPooling

SortPooling [147] first sorts node representations by their structural roles, which enables a consistent node ordering in different graphs and makes it possible to train typical neural networks on sorted node representations for pooling. In particular, SortPooling feeds the sorted representations into a 1-D CNN to get the graph representation, and makes the graph pooling operation to keep more information of global graph topology.

Though simple and effective, flat pooling ignores the hierarchical structure of nodes in a graph, e.g., nodes, subgraphs, and communities, thus leading to suboptimal graph representations.

6.5.2 Hierarchical Pooling

The basic idea of hierarchical pooling is to group structure-related nodes into clusters to form a subgraph recursively, and obtain the graph representation layer by layer. Next, we will introduce two typical hierarchical pooling operations.

DiffPool

DiffPool [140] proposes to learn hierarchical representations at the top of node representations and can be combined with various node representation learning methods in an end-to-end fashion. As shown in Fig. 6.9, DiffPool learns a differentiable soft cluster assignment for each node, and then maps nodes to a set of clusters layer by layer.

Fig. 6.9
An illustration depicts the original graph, which consists of clusters of interconnected nodal networks, which leads to pooled graph level 1, which consists of a single nodal network, which further leads to pooled graph level 2, which consists of a single node.

The illustration of DiffPool. The figure is redrawn from Figure 1 of [140]

Formally, let \({\mathbf {S}}^{k} \in \mathbb {R}^{C_k\times C_{k+1}}\) denote the learned cluster assignment matrix at the k-th layer, where \({\mathbf {S}}^{k}_{vc}\) indicates whether node v belongs to cluster c at k-th layer, and Ck is the number of clusters in each layer. With the cluster assignment matrix Sk, we can then calculate the adjacency matrix \({\mathbf {A}}^{k+1}\in \mathbb {R}^{C_{k+1}\times C_{k+1}}\) for the next layer by the connectivity strength between learned clusters in Sk:

$$\displaystyle \begin{aligned} {\mathbf{A}}^{k+1} = {{\mathbf{S}}^{k}}^{\top} {\mathbf{A}}^{k}{\mathbf{S}}^{k}. \end{aligned} $$
(6.43)

Then the output node representations Hk+1 are computed by GNN encoder:

$$\displaystyle \begin{aligned} {\mathbf{H}}^{k+1} = \operatorname{GNN}({\mathbf{A}}^{k+1},{\mathbf{X}}^{k+1}), \end{aligned} $$
(6.44)

where input node representations Xk+1 are obtained by aggregating the output representations from the k-th layer:

$$\displaystyle \begin{aligned} {\mathbf{X}}^{k+1} = {{\mathbf{S}}^{k}}^{\top} {\mathbf{H}}^{k}. \end{aligned} $$
(6.45)

DiffPool predefines the number of clusters for each layer, and applies another GNN on the coarsened adjacency matrix Ak to generate the soft cluster assignment matrix Sk. Finally, DiffPool feeds the top-layer graph representation into a downstream task classifier for the supervision of cluster assignment matrices.

gPool

As shown in Fig. 6.10, gPool [32] presents both graph pooling (gPool) and unpooling (gUnpool) operations, based on which the graph data is modeled by an encoder-decoder architecture. The encoder/decoder includes the same number of encoder/decoder blocks, and each encoder/decoder block will contain a GCN layer and a gPool/gUnpool operator. The representations after the last decoder block are used as final representations for downstream tasks.

Fig. 6.10
An illustration depicts the original graph, which consists of two clusters of interconnected nodal networks, which leads to pooled graph level 1, which consists of two single nodal networks, which further leads to pooled graph level 2, which consists of a single node.

Illustration of gPool and gUnpool. The figure is redrawn from Figure 2 of [32]

The gPool operation learns projection scores for each node with a learnable projection vector, and selects nodes with the highest scores as important ones to feed to the next layer:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{s}}^k & = {\mathbf{X}}^{k}{\mathbf{p}}^k/||{\mathbf{p}}^k||, \\ idx^k & = \text{rank}({\mathbf{s}}^k, n^k), \end{array} \end{aligned} $$
(6.46)

where sk is the importance score; pk and Xk are the projection vector and input feature matrix in the k-th layer, respectively; and rank(sk, nk) returns the indices of the elements with top-nk scores in sk.

Then in each layer, we define the adjacency matrix and input feature matrix based on the corresponding rows or columns indexed by idxk:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{A}}^{k+1} & =&\displaystyle {\mathbf{A}}^k(idx^k, idx^k), \\ \hat{\mathbf{s}}^k & =&\displaystyle \text{Sigmoid}({\mathbf{s}}^k(idx^k)), \\ \hat{\mathbf{X}}^k & =&\displaystyle {\mathbf{X}}^k(idx^k,:), \\ {\mathbf{X}}^{k+1} & =&\displaystyle \hat{\mathbf{X}}^k \odot (\hat{\mathbf{s}}^k{\mathbf{1}}_d^{\top}), \end{array} \end{aligned} $$
(6.47)

where 1d is a d-dimensional all-one vector and ⊙ is the element-wise matrix multiplication. Here the normalized scores \(\hat {\mathbf {s}}^k\) are used as weighted masks to further filter the feature matrix \(\hat {\mathbf {X}}^k\).

The gUnpool performs the inverse operation of the gPool operation, which restores the graph to its original structure. Specifically, gUnpool records the indices of selected nodes in the corresponding pooling level and then simply places nodes back in their original positions:

$$\displaystyle \begin{aligned} {\mathbf{X}}^{k-1} = \text{distribute}({\mathbf{0}}_{n^{k-1}\times d}, {\mathbf{X}}^k, idx^k), \end{aligned} $$
(6.48)

where idxk is the indices of nk selected nodes in the corresponding gPool level, and the function places row vectors in Xk into nk−1 × d all-zero feature matrix by index idxk.

Compared to DiffPool, gPool can reduce the storage complexity by replacing the cluster assignment matrix with a projection vector at each layer.

In this section, we introduce how to obtain global graph representation based on local node representations with graph pooling operations, which are widely used in a series of graph-level tasks such as graph classification and interaction prediction. When applying GNNs in modeling text, graph pooling can effectively help us extract sentence-level [72, 152] or document-level [23, 77] information for downstream tasks.

6.6 Self-Supervised Graph Representation Learning

Recently, self-supervised learning methods, which have made immense success in CV and NLP areas, can learn effective representations with well-designed pre-training tasks instead of expensive downstream task labels. Specifically, self-supervised graph representation learning [60] first learns node and graph representations by different graph-based pre-training tasks without human supervision, such as graph structure reconstruction or pseudo-label prediction. Then, the learned models or representations can be used in downstream tasks (e.g., graph/node classification). As shown in Figs. 6.11 and 6.12, we will introduce three types of self-supervised graph representation learning methods, namely, generative, predictive, and contrastive, distinguished by different pre-training tasks.

Fig. 6.11
Two illustrations. In the generative method, the information from the input graph is passed on to the encoder, which leads to representation, which leads to the decoder, which leads to the input graph, while in the predictive method, the decoder leads to the auxiliary information.

Illustration of generative (left) and predictive (right) methods

Fig. 6.12
An illustration depicts the layout of the contrastive method, in which information from the input graph is passed on to the encoder networks, leading to representations and the contrastive decoder.

Illustration of contrastive methods

Generative Methods

The generative methods aim to reconstruct and predict some components (e.g., graphs, edges, node features) of input data.

Structure Reconstruction

These works aim to recover the adjacency matrix or masked edges of a graph. Graph autoencoder (GAE) [51] learns node representations by a two-layer graph convolutional network and then reconstructs the adjacency matrix of the input graph, and variational graph autoencoder (VGAE) [51] is a latent variable variant of GAE. ARGA and ARVGA [75] are adversarial variants of GAE and VGAE, respectively, combining autoencoder and adversarial approaches. AGE [21] adaptively defines the reconstruction objective of adjacency matrices in an iterative manner.

Feature Reconstruction

These works aim to recover the node attributes of a graph, i.e., the input features of GNNs. MGAE [108] takes both corrupted network node content and structures as input, and predicts the origin node features. GALA [76] proposes a symmetric graph convolutional autoencoder based on Laplacian sharpening and smoothing to learn node representations by predicting input features. Here the Laplacian sharpening encourages the reconstructed feature of a node to be far away from those of its neighbors. GPT-GNN [44] uses both graph and feature generation pre-training tasks to model the structural and semantic information of the graph.

Predictive Methods

Predictive methods learn informative representations with self-supervised signals from some auxiliary information, such as pseudo labels and graph properties. Depending on the prediction targets, the predictive methods can be further divided into three types.

Property Prediction

These methods manually define some high-level information of the graph for prediction. GROVER [87] uses a graph Transformer as the encoder and employs both contextual property prediction and graph-level motif prediction to encode domain knowledge. S2GRL [79] takes the k-hop contextual prediction as a pre-training task and trains a well-designed GNN to learn representations. SELAR [46] predicts the meta-paths, which are composite relations of multiple edge types in heterogeneous graphs.

Pseudo-Label Prediction

This line of works employs an iterative framework to learn representations and update cluster labels. M3S [98] employs DeepCluster [13] algorithm to generate pseudo cluster labels, and designs an aligning mechanism and a multistage training framework to predict refined pseudo labels. IFC-GCN [43] proposes an EM-like framework to alternately rectify pseudo labels of feature clustering, and update node features by predicting pseudo labels.

Invariant Information Preserving

These works aim to preserve some intrinsic and invariant information in the graph. Compared with the property prediction-based methods, there is usually no explicit meaning or closed-form formula for the invariant information. CCA-SSG [145] employs the canonical-correlation analysis (CCA) method to maximize the correlation between two augmented views of the same input, and thus preserve augmentation-invariant information. Lagraph [118] assumes that there exists a latent graph without noises behind each observed graph, and the observed graph is randomly generated from the latent one. Lagraph implicitly predicts the latent graph as a pre-training task to learn informative graph and node representations.

Contrastive Methods

Contrastive methods first generate two contrastive views from graphs/nodes, and then maximize the mutual information between them. In this way, the learned representations will be more robust to perturbations. Typically, contrastive methods regard the views from the same graph/node as a positive pair, and the views randomly selected from different graphs/nodes as negative pairs. The representation similarity between a positive pair is forced to be larger than negative ones. Categorized by the views to contrast, contrastive methods can be divided into two groups.

Substructure-Based Methods

This line of works usually contrasts views between different scales of structures. InfoGraph [95] takes different substructures of the original graph (e.g., nodes, edges, triangles) as contrastive views to generate graph-level representations. DGI [107] and GMI [80] build contrast views between a graph and its nodes. MVGRL [39] generates views by sampling subgraphs, and learns both node- and graph-level representations. GCC [82] treats different subgraphs as contrastive views and introduces InfoNCE loss [73] to large-scale graph pre-training.

Augmentation-Based Methods

These works usually generate views by applying different perturbations on input graphs and features. GRACE [158] and GraphCL [142] randomly perturb the input graph (e.g., node and edge dropping) to generate contrastive views. GCA [159] adaptively builds contrastive views based on different graph properties (e.g., node degree, eigenvector, PageRank). Here nodes or edges with small importance scores (e.g., node degrees) are more likely to be dropped in contrastive views. JOAO [141] and AD-GCL [100] automatically select graph augmentations and generate contrastive views in adversarial ways. Instead of contrasting augmented data, DSGC [132] conducts contrastive views from dual spaces of hyperbolic and Euclidean. GASSL [134] generates views by directly perturbing input features and hidden layers. SimGRACE [117] adds Gaussian noises to model parameters as perturbations to generate contrastive views. MA-GCL [34] proposes a novel augmentation strategy that randomly perturbs the neural architecture of GNN encoders (i.e., random permutations of graph filter, linear projection, and nonlinear function) as contrastive views. HeCo [112] employs network schema and meta-path views in heterogeneous graphs as two specific views.

Adaptation Approaches

After the self-supervised training process, there are roughly three paradigms to adapt the learned models or representations for downstream tasks.

Pre-training-Fine-Tuning

These methods [44, 87, 142] first train the model parameters of graph encoder on the datasets without labels in the self-supervised way, and then the pre-trained parameters are used as the initial parameters in the next fine-tuning step, which updates the encoder in a supervised way by downstream tasks.

Unsupervised Representation Learning

These methods [51, 75, 79, 158] train the graph encoder with pre-training tasks in the first stage, and the pre-trained encoder is taken as a feature extractor with frozen parameters to generate representations for downstream tasks in the second stage.

Multitask Training

These methods [46, 82, 98] train graph encoders on both pre-training and downstream tasks with well-designed loss functions, which can be seen as a type of multitask learning, where the pre-training tasks are auxiliary tasks for downstream ones.

In summary, self-supervised graph representation learning methods can learn effective graph and node representations based on various pre-training tasks without labels. Different from pre-trained language models where pre-training-fine-tuning are the mainstream for adaptation of downstream tasks, the unsupervised representation learning paradigm is widely used in graph data, where only the node/graph representations in the last feed-forward layer are fed into downstream tasks as features. An intuitive reason is that popular tasks on graph data (e.g., node/graph classification) require less model capacity than those on NLP (e.g., machine translation). Therefore, the final representations in graph encoders are usually sufficient, and it’s not necessary to fine-tune the learned graph encoders.

6.7 Applications

In this section, we will introduce several typical applications of graph representation learning in the NLP area.

Text Classification

Text classification is an essential task in NLP. Typical GNN models, including GCNs [2, 23, 37, 41, 52, 69] and GATs [106], are applied in text classification to model the structural information (e.g., citation relationship) between documents. However, these works did not fully model the structural information lying in texts. Thus, some works manage to construct graphs from texts. Peng et al. [77] propose to transform texts into a graph of words, and then apply graph convolutional operations to it. Yao et al. [135] propose to construct a text graph with document and word nodes, and utilize GCNs to learn representations for both word nodes and document nodes. Besides constructing the text graph with human heuristics as in the above works, there are also some intrinsic graph structures that can be used, such as dependency parsing trees, semantic parsing graphs, etc. One of the most typical works of utilizing these intrinsic graph structures is the Tree-LSTM [101] introduced in this chapter.

Sequence Labeling

Sequence labeling is another classical task in NLP, which aims to assign labels for each word in the text sequence. Sentence LSTM [151] introduces graph recurrent networks for sentence modeling, where each word node connects with its neighboring words in a context window and a global sentence node links with all word nodes. Then, the hidden representations of word nodes can be used for predicting word labels. Sentence LSTM achieves promising performance in the POS tagging and NER tasks. To solve the semantic role labeling task, Marcheggiani et al. [65] propose a variant of GCN [52] which performs reasoning on syntactic dependency trees (i.e., a graph with labeled edges), and employs an edgewise gate mechanism to consider the information of each dependency edge. They further show that GCNs and LSTMs are functionally complementary in this task.

Knowledge Acquisition

As a crucial subtask for knowledge acquisition, relation extraction aims to predict the relationship between entities in plain text. While CNNs and RNNs have achieved promising results on multiple benchmarks, researchers find the syntactic and semantic information of the text, such as adjacency, syntactic dependencies, and discourse relations, are also helpful for relation extraction. To this end, Zhang et al. [152] propose a GNN-based method for relation extraction, which can efficiently aggregate information from arbitrary dependency graphs. It also applies a novel pruning strategy to the input tree and only keeps the informative words for relation extraction. To model the rich relations across entities, Zhu et al. [157] propose to construct entity graphs and generate edge parameters to propagate diverse relational information, which greatly extends edge types and enables GNNs to conduct complex reasoning. Considering cross-sentence dependencies like coreference and discourse, Peng et al. [78] further explore extracting N-ary relations among multiple entities from multiple sentences by applying Graph LSTMs on document graphs.

Event extraction is another knowledge acquisition task, which aims to identify event triggers and the corresponding arguments for each trigger in texts. Nguyen et al. [72] propose syntactic GCN, which models the dependency tree and learns representations of word nodes to extract events from texts. In addition, Liu et al. [59] find that modeling syntactic relations help capture long-range dependencies better, and these shortcut arcs can help extract multiple events jointly. Meanwhile, to deal with ambiguous and unseen triggers, Zhang et al. [149] summarize event structure knowledge from training data, and construct event background graphs for each event. These graphs help identify correct events by matching the structure knowledge.

Fact Verification

Fact verification aims to retrieve evidence from plain text and verify given claims with the evidence. In other words, we need to label a given claim as SUPPORTED, REFUTED, or NOT ENOUGH INFO, which indicates that the evidence can support, refute, or is not sufficient for validating the claim. By regarding the problem as a natural language inference (NLI) [1] task, traditional methods simply combine the evidence via concatenation, or build models based on evidence-claim pairs. To integrate multiple evidence to reason facts, Zhou et al. [156] propose a graph-based evidence aggregating and reasoning framework, which propagates and aggregates information on a fully connected evidence graph. In this way, different pieces of evidence can have sufficient interactions with each other, and thus can help the situations in which multiple pieces of evidence are necessary for making the decision. Liu et al. [61] further incorporate kernel-based attention mechanism into GAT for fine-grained evidence aggregation, and the proposed KGAT achieves improved performance.

Machine Translation

Traditionally, machine translation (MT) is modeled as a sequence-to-sequence (seq2seq) problem. The graph structure, however, enables MT models to incorporate explicit linguistic biases. To this end, two typical kinds of graphs can be utilized in MT. Some works [4, 5, 36] model syntactic trees with GCNs to learn syntax-aware sentence representations, and show improvements over vanilla seq2seq methods. To involve more semantic information, other works further consider semantic role labeling (SRL) as well as abstract meaning representation (AMR) graphs [64, 94]. Besides traditional MT, GNNs are also utilized in multimodal and document MT. For multimodal MT, Yin et al. [137] build combined graphs with both entities in images and sentences and then apply GNNs for message passing across modalities. For document MT, Xu et al. [121] model long-term dependencies (e.g., coreference) with GNNs on document graphs, and achieve superior performance on translation coherence.

Question Answering

Question answering (QA) aims to generate or find answers for a question based on relevant documents or knowledge bases, which requires models to reason and infer the right answers given the question (see Chap. 4 for an introduction). Since GNNs are designed to model relational data, they are also suitable for QA. To apply GNNs to QA, we need to first build graphs containing question-related entities and their relations. Typically, according to the provided context, researchers could utilize existing knowledge graphs [96, 97, 136] or extract entities and links from documents to construct graphs [11, 24, 29, 30, 58]. Afterward, multiple kinds of GNNs such as GCN, GAT, and R-GCN [90] can be directly used to reason over the graphs [24, 58]. To jointly capture relational and semantic messages, some works explore how to combine powerful PTMs and GNNs for complex reasoning. Ding et al. [24] construct entity graphs utilizing BERT, which predicts node entities and relational edges iteratively. Yasunaga et al. [136] further incorporate PTM-encoded context representations into knowledge graphs to better perform message passing.

Besides the applications in the NLP area, GNNs are widely used in various application scenarios, such as community detection [104, 133, 148], information diffusion prediction [130], recommender systems [8, 28, 115, 129, 139], molecular fingerprints [50], chemical reaction prediction [25], protein interface prediction [31], biomedical engineering [86, 161], etc. Since our book mainly focuses on NLP, readers who are interested in these applications can refer to these papers for more details.

6.8 Summary and Further Readings

In this chapter, we have introduced graph representation learning, which projects graph structure information into continuous vector space and makes deep learning techniques possible on graph data. We first talk about shallow node representation, from spectral clustering, shallow neural networks, to matrix factorization. Then we introduce deep node representation, from autoencoder-based methods to graph neural networks. Afterward, we present how to obtain the global graph representation from node representations. Finally, we introduce how graph representation learning helps a series of NLP tasks.

For further readings of graph representation learning, there are also some recommended reviews and books. In terms of general graph embedding, Goyal and Ferrara [35] and Cui et al. [22] conduct surveys on relevant models and their applications. We also write a monograph [126] about our systematic work on the topic of network embedding. In terms of GNNs, Wu et al. [114] write a book covering more than 20 topics of GNNs in 4 parts: introduction, foundations, frontiers, and applications. Shi et al. [93] write a monograph providing a launch point for discussing the latest trends of GNNs. Wu et al. [113] present a survey about the NLP applications based on GNNs. Shi et al. [92] focus on the representation learning of heterogeneous graphs. We also provide a more comprehensive survey of GNNs in our review [155], covering a broader range of aspects, such as the applications on images or chemistry.