Graph Representation Learning

Yang, Cheng; Lin, Yankai; Liu, Zhiyuan; Sun, Maosong

doi:10.1007/978-981-99-1600-9_6

Cheng Yang⁴,
Yankai Lin⁵,
Zhiyuan Liu⁶ &
…
Maosong Sun⁶

3197 Accesses
1 Citations

Abstract

Graph structure, which can represent objects and their relationships, is ubiquitous in big data including natural languages. Besides original text as a sequence of word tokens, massive additional information in NLP is in the graph structure, such as syntactic relations between words in a sentence, hyperlink relations between documents, and semantic relations between entities. Hence, it is critical for NLP to encode these graph data with graph representation learning. Graph representation learning, also known as network embedding, has been extensively studied in AI and data mining. In this chapter, we introduce a variety of graph representation learning methods that embed graph data into vectors with shallow or deep neural models. After that, we introduce how graph representation learning helps NLP tasks.

You have full access to this open access chapter, Download chapter PDF

Deep Learning for Learning Graph Representations

Graph Neural Networks

A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions

Article Open access 16 January 2024

6.1 Introduction

Graph is a natural way to represent objects and their relationships. As a typical non-Euclidean data structure, it provides a flexible way to model the interactions between individual units in our daily lives. For example, social media networks, citation graphs, biological networks, and recommendation systems can all be modeled as graph structures.

Graph structure plays an equally important role in the NLP area as the sequence form introduced in the previous chapters. As shown in Fig. 6.1, multiple granularities of text can be organized in graph forms:

1.
Word Level. There are a variety of syntactic/semantic relations between the words within a sentence or a document. For example, we can obtain the dependency relations between words by dependency parsing or the coreference relations between words by coreference resolution. These syntactic/semantic relationships can effectively help us understand the compositional semantics of different constituents in the sentence and document. We can naturally regard words as nodes and their syntactic/semantic relations as edges, representing a sentence or a document as a relational graph.
Fig. 6.1
An illustrative example of text-based graphs in different levels. The documents used in the figure are obtained from Wikipedia’s official website
Full size image
2.
Document Level. In the real world, documents usually connect with each other. For example, in the articles of Wikipedia, the entity mentions within an article may exist human-annotated hyperlinks to the articles describing these entities, or a scientific paper may refer to other scientific papers as its related works. The connected documents may provide important background knowledge to comprehend the meaning of the document. We can represent the interactions between documents as a document graph by regarding documents as nodes and hyperlinks as edges.
3.
Knowledge Level. Human knowledge, such as world, linguistic, commonsense, and domain knowledge is essential for understanding languages. In practice, most of this knowledge can be organized in a graph form. For example, the world knowledge graph in Chap. 9 regards entities as nodes and their relationships as edges. Representing human knowledge as graphs enables further complex reasoning over the document and knowledge.

It is critical to model the structural information in graph data, which could help models better understand, categorize, and reason text in NLP tasks. Graph representation learning aims to learn the low-dimensional representations of nodes or the entire graph. The geometric relationship in the low-dimensional semantic space should effectively reflect the structural information of the original graph, such as the global topological structure or the local graph connection.

In this chapter, we discuss how to properly represent graph data and their characteristics by starting from symbolic representations (Sect. 6.2). Then, we move to distributed representations including the local representations of nodes (Sects. 6.3 and 6.4) as well as the global representation of the whole graph (Sect. 6.5). We also introduce the recent advances of self-supervised learning on graphs (Sect. 6.6). Finally, we present how text is processed in graph form in downstream NLP tasks (Sect. 6.7), mainly for word, sentence, and document levels. We leave a detailed introduction of knowledge-level applications in Chap. 9.

6.2 Symbolic Graph Representation

A common practice is to denote the graph with its node set $\mathcal {V}$ and the edge set $\mathcal {E}$. We can naturally represent a graph as $\mathcal {G}=(\mathcal {V}, \mathcal {E})$, where $e=(v_i,v_j)\in \mathcal {E}$ is a directed edge from node v_i to v_j or an undirected one between v_i and v_j. When processing graph data in a computer, we usually represent the connections in a graph as an adjacency matrix.

For an undirected graph, as shown in Fig. 6.2, we construct an adjacency matrix $\mathbf {A}\in \mathbb {R}^{|\mathcal {V}|\times |\mathcal {V}|}$. If there is any edge between node v and node u, i.e., $(v,u)\in \mathcal {E}$, we have the corresponding element A_vu = A_uv = 1; otherwise A_vu = A_uv = 0.

An illustration depicts an undirected graph connecting nodes A to E, which leads to a five-by-five matrix with columns and rows labeled from A to E. The matrix consists of the binary numbers 0 and 1. — **Fig. 6.2**

For a directed graph, as shown in Fig. 6.3, A_vu = 1 indicates there is an edge from node v to node u.

An illustration depicts a directed graph connecting nodes A to E, which leads to a five-by-five matrix with columns and rows labeled from A to E. The matrix consists of the binary numbers 0 and 1. — **Fig. 6.3**

Moreover, for a weighted graph, we can store the weights of the edges instead of binary values in the adjacency matrix A, as shown in Fig. 6.4.

An illustration depicts a weighted graph connecting nodes A to E, which leads to a five-by-five matrix with columns and rows labeled from A to E. The matrix consists of the binary numbers 0 and 1. — **Fig. 6.4**

In the era of statistical learning, such symbolic representations are widely used in graph-based NLP such as TextRank [67], where word and sentence graphs can be respectively built for keyword extraction [6] and extractive document summarization [66]. Although convenient and straightforward, the representation of the adjacency matrix suffers from the scalability problem. Firstly, adjacency matrix A takes $|\mathcal {V}|\times |\mathcal {V}|$ storage space, which is usually unacceptable when $|\mathcal {V}|$ grows large. Secondly, the adjacency matrix is usually sparse, which means most of its entries are zeros. The data sparsity makes discrete algorithms applicable, but it is still hard to develop efficient algorithms for statistical learning [81].

6.3 Shallow Node Representation Learning

To address the above issues, shallow node representation learning methods propose to map nodes to low-dimensional vectors. Formally, the goal is to learn a d-dimensional vector representation $\mathbf {v}\in \mathbb {R}^d$ for every node $v\in \mathcal {V}$ in the graph. Learned node representations are supposed to capture the graph’s structure information and then can be used as input features for downstream graph-related tasks. In this section, we will introduce several kinds of shallow node representation learning methods, including spectral clustering, shallow neural networks, and matrix factorization.

6.3.1 Spectral Clustering

Early methods of shallow node representation are usually based on spectral clustering, which typically computes the first k eigenvectors (or singular vectors) of an affinity matrix, such as adjacency or Laplacian matrix of the graph. Now we present several algorithms based on spectral clustering, including locally linear embedding, Laplacian Eigenmap, directed graph embedding, and latent social dimensions.

Locally Linear Embedding (LLE)

LLE [88] assumes that the representations of a node and its neighbors lie in a locally linear patch of the manifold. In other words, a node’s representation can be approximated by a linear combination of the representation of its neighbors. LLE uses the linear combination of neighbors to reconstruct the center node. Formally, the reconstruction error of all nodes can be expressed as

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{W},\mathbf{V})=\sum_{v\in \mathcal{V}}\Vert \mathbf{v}-\sum_{u\in \mathcal{V}} {\mathbf{W}}_{vu}\mathbf{u} \Vert^2, {} \end{aligned} $$

(6.1)

where $\mathbf {V} \in \mathbb {R}^{|\mathcal {V}|\times d}$ is the representation matrix containing all node representations v and W_vu is the learnable contribution coefficient of node u to v. LLE enforces W_vu = 0 if v and u are not connected, i.e., $(v,u)\not \in \mathcal {E}$. Further, the summation of a row of matrix W is set to 1, i.e., $\sum _{u\in \mathcal {V}} {\mathbf {W}}_{vu}=1$.

Equation (6.1) is solved by alternatively optimizing weight matrix W and representation V. The optimization over W can be solved as a least-squares problem. The optimization over V leads to the following optimization problem:

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{W},\mathbf{V})=\sum_{v\in \mathcal{V}}\Vert\mathbf{v}-\sum_{u\in \mathcal{V}} {\mathbf{W}}_{vu}\mathbf{u}\Vert^2, \end{aligned} $$

(6.2)

$$\displaystyle \begin{aligned} s.t. \ \sum_{v\in \mathcal{V}} \mathbf{v}=\mathbf{0},\quad \quad |\mathcal{V}|{}^{-1} \sum_{v\in \mathcal{V}} {\mathbf{v}}^{\top}\mathbf{v}={\mathbf{I}}_d, \end{aligned} $$

(6.3)

where I_d denotes d × d identity matrix. The conditions in Eq. (6.3) ensure the uniqueness of the solution. The first condition enforces the center of all node representations to zero point, and the second condition guarantees different coordinates have the same scale, i.e., equal contribution to the reconstruction error.

The optimization problem can be formulated as the computation of eigenvectors of matrix $({\mathbf {I}}_{|\mathcal {V}|}-{\mathbf {W}}^{\top })({\mathbf {I}}_{|\mathcal {V}|}-\mathbf {W})$, which is an easily solvable eigenvalue problem. More details can be found in the note [20].

Laplacian Eigenmap

Laplacian Eigenmap [7] simply follows the idea that the representations of two connected nodes should be close. Specifically, the “closeness” is measured by the square of Euclidean distance. We use $\mathbf {D}\in \mathbb {R}^{|\mathcal {V}|\times |\mathcal {V}|}$ to denote diagonal degree matrix where D_vv is the degree of node v. By defining the Laplacian matrix L as the difference between D and adjacency matrix A, we have L = D −A. Laplacian Eigenmap algorithm wants to minimize the following objective:

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{V})=\sum_{\{v,u|(v,u)\in \mathcal{E}\}}\Vert\mathbf{v}-\mathbf{u}\Vert^2, {} \end{aligned} $$

(6.4)

$$\displaystyle \begin{aligned} s.t.\ {\mathbf{V}}^{\top}\mathbf{D}\mathbf{V}={\mathbf{I}}_d. {} \end{aligned} $$

(6.5)

The cost function is the summation of the square loss of all connected node pairs, and the condition prevents the trivial all-zero solution caused by arbitrary scale. Equation (6.4) can be reformulated in matrix form as

$$\displaystyle \begin{aligned} {\mathbf{V}}^*=\underset{{\mathbf{V}}^{\top}\mathbf{DV}={\mathbf{I}}_d}{\arg\min} \text{tr}({\mathbf{V}}^{\top}\mathbf{L}\mathbf{V}), {} \end{aligned} $$

(6.6)

where tr(⋅) is the matrix trace function. The optimal solution V^∗ of Eq. (6.6) is the eigenvectors corresponding to d smallest nonzero eigenvalues of L. Note that the Laplacian Eigenmap algorithm can be easily generalized to the weighted graph.

A significant limitation of both LLE and Laplacian Eigenmap is that they have a symmetric cost function, leading to both algorithms not being applied to directed graphs.

Directed Graph Embedding (DGE)

DGE [17] generalizes Laplacian Eigenmap for both directed and undirected graphs based on a predefined transition matrix. For example, we can define a transition probability matrix $\mathbf {P}\in \mathbb {R}^{|\mathcal {V}|\times |\mathcal {V}|}$ where P_vu denotes the probability that node v walks to u. The transition matrix defines a Markov random walk through the graph. We denote the stationary value of node v as π_v where ∑_vπ_v = 1. The stationary distribution of random walks is commonly used in many ranking algorithms such as PageRank [74]. DGE designs a new cost function that emphasizes those important nodes with higher stationary values:

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{V})=\sum_{v\in \mathcal{V}}\pi_v \sum_{u\in \mathcal{V}}{\mathbf{P}}_{vu}\Vert \mathbf{v}-\mathbf{u}\Vert^2. {} \end{aligned} $$

(6.7)

By denoting $\mathbf {M} = \text{diag}(\pi _1,\pi _2,\dots ,\pi _{|\mathcal {V}|})$, the cost function Eq. (6.7) can be reformulated as

$$\displaystyle \begin{aligned} \mathcal{L}(\mathbf{V})=2\text{tr}({\mathbf{V}}^{\top}\mathbf{B}\mathbf{V}), {} \end{aligned} $$

(6.8)

$$\displaystyle \begin{aligned} s.t. \ {\mathbf{V}}^{\top}\mathbf{M} \mathbf{V}={\mathbf{I}}_d, {} \end{aligned} $$

(6.9)

where

$$\displaystyle \begin{aligned} \mathbf{B}=\mathbf{M}-\frac{\mathbf{M} \mathbf{P}-{\mathbf{P}}^{\top}\mathbf{M}}{2}. {} \end{aligned} $$

(6.10)

The condition Eq. (6.9) is added to remove an arbitrary scaling factor of solutions. Similar to Laplacian Eigenmap, the optimization problem can also be solved as a generalized eigenvector problem.

Latent Social Dimensions

Latent social dimensions [103] introduce modularity [71] into the cost function instead of minimizing the distance between node representations in previous works. Modularity is a measurement that characterizes how far the graph is away from a uniform random graph. Given $\mathcal {G}=(\mathcal {V},\mathcal {E})$, we assume that nodes $\mathcal {V}$ are divided into n nonoverlapping communities. By “uniform random graph,” we mean nodes connect to each other based on a uniform distribution given their degrees. Then, the expected number of edges between v and u is $\operatorname {deg}(v)\frac {\operatorname {deg}(u)}{2|\mathcal {E}|}$. Then, the modularity Q of a graph is defined as

$$\displaystyle \begin{aligned} Q=\frac{1}{2|\mathcal{E}|}\sum_{v,u}\Big({\mathbf{A}}_{vu}-\frac{\operatorname{deg}(v)\operatorname{deg}(u)}{2|\mathcal{E}|}\Big)\delta(v,u), \end{aligned} $$

(6.11)

where δ(v, u) = 1 if v and u belong to the same community and δ(v, u) = 0 otherwise. Larger modularity indicates that the subgraphs inside communities are denser, which follows the intuition that a community is a dense well-connected cluster. Then, the problem is to find a partition that maximizes the modularity Q.

However, a hard clustering on modularity maximization is proved to be NP-hard. Therefore, they relax the problem to a soft case. Let $\mathbf {d}\in \mathbb {Z}_+^{|\mathcal {V}|}$ denotes the degrees of all nodes and $\mathbf {S}\in \{0,1\}^{|\mathcal {V}|\times n}$ denotes the community indicator matrix where S_vc = 1 indicates node v belongs to community c and S_vc = 0 otherwise. Then, we define modularity matrix B as

$$\displaystyle \begin{aligned} \mathbf{B}=\mathbf{A}-\frac{\mathbf{d}{\mathbf{d}}^{\top}}{2|\mathcal{E}|}, \end{aligned} $$

(6.12)

and modularity Q can be reformulated as

$$\displaystyle \begin{aligned} Q=\frac{1}{2|\mathcal{E}|}\text{tr}({\mathbf{S}}^{\top}\mathbf{B}\mathbf{S}). {} \end{aligned} $$

(6.13)

By relaxing S to a continuous matrix, it has been proved that the optimal solution of S is the top-n eigenvectors of modularity matrix B [70]. Then, the eigenvectors can be used as node representations.

To conclude, spectral clustering-based methods often define a cost function that is linear or quadratic to the node representations. Then, the problems can be reformulated as a matrix form, and then solved by calculating the eigenvectors of the matrix. However, the computation of eigenvectors for large-scale matrices is both time- and space-consuming, limiting these methods from being applied in real-world scenarios.

6.3.2 Shallow Neural Networks

With the success of word2vec [68], many works resort to shallow neural networks for node representation learning. Typically, each node is assigned a vector of trainable parameters as its representation, and the parameters are trained by optimizing a certain objective via gradient descent.

DeepWalk

DeepWalk [81] proposes a novel approach that introduces neural network techniques into graph representation learning for the first time. Compared with the aforementioned methods based on eigenvector computation, DeepWalk provides a faster way to learn low-dimensional node representations. The basic idea of DeepWalk is to adapt the well-known word representation learning algorithm word2vec [68] by regarding nodes as words and random walks as sentences.

Formally, given a graph $\mathcal {G}=(\mathcal {V},\mathcal {E})$, DeepWalk uses node v to predict its neighboring nodes in short random walk sequences u_−w, …, u₋₁, u₁, … , u_w where w is the window size of skip-gram [68], which can be formulated as

$$\displaystyle \begin{aligned} \min \sum_{j=-w,j\neq 0}^w-\log P(u_{j}|v), \end{aligned} $$

(6.14)

where it assumes the prediction probabilities of each node are independent and the overall loss integrates the losses of all nodes in every random walk.

DeepWalk assigns each node v with two representations: node representation $\mathbf {v}\in \mathbb {R}^d$ and context representation ${\mathbf {v}}^c\in \mathbb {R}^d$. Then, each probability P(u|v) is formalized as a Softmax function over all nodes:

$$\displaystyle \begin{aligned} P(u|v)=\frac{\exp({\mathbf{v}}^{\top}{\mathbf{u}}^c)}{\sum_{u'\in \mathcal{V}}\exp({\mathbf{v}}^{\top}\mathbf{u'}^c)}. {} \end{aligned} $$

(6.15)

LINE

LINE [102] proposes a graph representation model which can handle large-scale graphs with arbitrary types: (un)directed or weighted. To characterize the interaction between nodes, LINE models the first-order proximity and second-order proximity of the graph.

Before we introduce the details of the algorithm, we can move one step back and see how the idea works. The modeling of first-order proximity, i.e., observed links, is the modeling of the adjacency matrix. As the adjacency matrix is usually too sparse, the modeling of second-order proximity, i.e., nodes with shared neighbors, can serve as complementary information to enrich the adjacency matrix and make it denser.

Formally, first-order proximity between node v and u is defined as the edge weight A_vu in the adjacency matrix. If nodes v and u are not connected, then the first-order proximity between them is 0.

Second-order proximity between node v and u is defined as the similarity between their neighbors. Let the row of node v in the adjacency matrix A(v, :) denote the first-order proximity between node v and other nodes. Then, the second-order proximity between v and u is defined as the similarity between A(v, :) and A(u, :). If they have no shared neighbors, the second-order proximity is 0.

To approximate first-order proximity $\hat {P}_1(v,u)=\frac {{\mathbf {A}}_{vu}}{\sum _{(v',u')\in \mathcal {E}}{\mathbf {A}}_{v'u'}}$, the joint probability between v and u is defined as

$$\displaystyle \begin{aligned} P_1(v,u)=\frac{1}{1+\exp(-{\mathbf{v}}^{\top}\mathbf{u})}. \end{aligned} $$

(6.16)

To approximate second-order proximity $\hat {P}_2(u|v)=\frac {{\mathbf {A}}_{vu}}{\sum _{u'}{\mathbf {A}}_{vu'}}$, the probability that node u appears in v’s context P₂(u|v) is defined in the same form as Eq. (6.15). In this way, the second-order relationship between node representations is bridged by the context representations of their shared neighbors.

For parameter learning, the distances between $\hat {P}_1(v,u)$ and P₁(v, u) as well as $\hat {P}_2(u|v)$ and P₂(u|v) are minimized. In specific, LINE learns node representations for first-order and second-order proximities individually, and then concatenates them as output embeddings. Although LINE can effectively capture both first-order and second-order local topological information, it cannot be easily extended to capture higher-order global topological information.

Connection with Matrix Factorization

We prove that DeepWalk algorithm with the skip-gram model is actually factoring a matrix M where each entry M_ij is the logarithm of the average probability that node v_i randomly walks to node v_j in fixed steps [127]. Intuitively, the matrix M is much denser than the adjacency or Laplacian matrices, and therefore can help representations encode more structural information. In practice, the entry value M_ij is estimated by random walk sampling. A more detailed proof can be found at our arxiv note [125]. Moreover, LINE is also proved to be equivalent to matrix factorization [83, 128], where the matrix is defined by the first- and second-order proximities.

To summarize, DeepWalk and LINE introduce shallow neural networks into graph representation learning. Thanks to the modeling ability of neural networks and the efficiency of shallow representations, these methods can outperform conventional graph representation learning methods such as spectral clustering-based algorithms, and are also efficient for large-scale graphs.

6.3.3 Matrix Factorization

Inspired by the connection between DeepWalk and matrix factorization, many research works turn to explore how to learn better node representations by regarding its learning phase as a matrix factorization problem. Note that the eigenvector decomposition in spectral clustering methods can also be seen as a special case of matrix factorization. Here we present two typical matrix factorization-based graph representation learning algorithms, GraRep [12] and TADW [127] in this subsection.

GraRep

GraRep [12] directly follows the proof of matrix factorization form of DeepWalk. According to our proof [125], DeepWalk is actually factorizing a matrix M where $\mathbf {M}=\log \frac {\mathbf {A}+{\mathbf {A}}^2+\dots +{\mathbf {A}}^K}{K}$. From the matrix factorization form, DeepWalk has considered the high-order topological information of the graph jointly. In contrast, GraRep proposes to regard different k-step information separately in graph representation learning, and can be divided into three steps:

Calculate k-step transition probability matrix A^k for each k = 1, 2, …, K.
Obtain each k-step node representations.
Concatenate all k-step node representations as the final node representations.

For the second step to obtain the k-step node representations, GraRep directly uses a typical matrix decomposition technique, i.e., SVD on A^k. However, this algorithm is not very efficient, especially when k becomes large.

TADW

As the first attributed network embedding algorithm where node features are also available for learning node representations, text-associated DeepWalk (TADW) [127] further generalizes the matrix factorization framework to take advantage of text information. As shown in Fig. 6.5, the main idea of TADW is to factorize node affinity matrix $\mathbf {M}\in \mathbb {R}^{|\mathcal {V}|\times |\mathcal {V}|}$ into the product of three matrices, $\mathbf {W}\in \mathbb {R}^{d\times |\mathcal {V}|}$, $\mathbf {H}\in \mathbb {R}^{d\times f_t}$, and text feature matrix $\mathbf {T} \in \mathbb {R}^{f_t\times |\mathcal {V}|}$, where d and f_t are the rank and feature dimensions, respectively. Then, TADW concatenates W and HT as 2d-dimensional node representations $\mathbf {V}=\operatorname {concat}(\mathbf {W};\mathbf {HT})$.

An illustration depicts a node affinity matrix, which is indicated by a square component labeled M, which leads to the product of three matrices such as W superscript T, H, and T. — **Fig. 6.5**

Now the question is how to build node affinity matrix M and how to extract text feature matrix T from the text information. Following the proof of matrix factorization form of DeepWalk, TADW set node affinity matrix M to a trade-off between efficiency and effectiveness: factorizing the matrix M = (A + A²)∕2 where A is the row-normalized adjacency matrix. In this way, M can encode both first-order and second-order proximities. For text feature matrix T, TADW first constructs the TF-IDF matrix from the text of nodes, and then reduces the dimension of the TF-IDF matrix via SVD decomposition.

Formally, the model of TADW minimizes the following optimization function:

$$\displaystyle \begin{aligned} \min_{\mathbf{W},\mathbf{H}} ||\mathbf{M}-{\mathbf{W}}^{\top}\mathbf{H}\mathbf{T}||{}_F^2+\frac{\lambda}{2}(||\mathbf{W}||{}_F^2+||\mathbf{H}||{}_F^2), \end{aligned} $$

(6.17)

where λ is the regularization factor and ||⋅||_F is the Frobenius norm. The optimization of parameters is processed by updating W and H iteratively via conjugate gradient descent.

To summarize, the shallow graph representation learning methods have shown outstanding ability in modeling various kinds of graphs, such as large-scale [153] or heterogeneous [110] graphs. They are also used in many application scenarios such as recommendation systems [91, 129]. However, they still have several drawbacks [38]. Firstly, the model capacity of shallow methods is usually limited, and leads to suboptimal performance in complex scenarios. Secondly, the representations of different nodes share no parameters, which makes the number of parameters grow linearly with the number of nodes. This leads to the problem of computational inefficiency. Thirdly, the shallow graph representation methods are typically transductive, and thus cannot generalize to new nodes in a dynamic graph without retraining.

6.4 Deep Node Representation Learning

To address the problems of shallow node representations, researchers propose to utilize deep neural networks to aggregate information from graph structure. In this section, we introduce several typical solutions for node representation learning, including autoencoder-based methods, graph convolutional networks, graph attention networks, graph recurrent networks, graph Transformers, and several extensions based on them. Most deep node representation learning methods assume node features are available, and stack multiple neural network layers for representation learning. Here we denote the initial feature vector of node v as x_v and the hidden representation of node v at the k-th layer as ${\mathbf {h}}_v^k$.

6.4.1 Autoencoder-Based Methods

Different from previous methods that use shallow neural networks to characterize the graph representations, structural deep network embedding (SDNE) [109] employs a deep autoencoder to model more complex relationships between node representations. The main body of SDNE is a deep autoencoder whose input and output vectors are the initial node feature x_v and reconstructed feature $\hat {\mathbf {x}}_v$, respectively. The algorithm takes the representations from the intermediate layer as node embeddings and encourages the embeddings of connected nodes to be similar.

Formally, a deep autoencoder first compresses the input node feature into a low-dimensional intermediate vector and then tries to reconstruct the original input from the low-dimensional intermediate vector. Hence, the intermediate hidden representation can capture the information of the input node since we can recover the input features from them. Assume the input vector is x_v, then the hidden representation of each layer k is defined as

$$\displaystyle \begin{aligned} {\mathbf{h}}_v^{1}&=\operatorname{Sigmoid}({\mathbf{W}}^{1}{\mathbf{x}}_v+{\mathbf{b}}^{1}),\\ {\mathbf{h}}_v^{k}&=\operatorname{Sigmoid}({\mathbf{W}}^{k}{\mathbf{h}}_v^{k-1}+{\mathbf{b}}^{k}),k=2,3\dots ,K\dots, \end{aligned} $$

(6.18)

where W^k and b^k are weighted matrix and bias vector of the k-th layer. We assume that the hidden representation of the K-th layer has the minimum dimension, and the intermediate layer ${\mathbf {h}}_v^{K}$ can be seen as the low-dimensional representation of node v. Afterward, we can compute the output $\hat {\mathbf {x}}_v$ by applying the reversed calculation process on ${\mathbf {h}}_v^{K}$. The optimization objective of the autoencoder is to minimize the difference between input vector x_v and output vector $\hat {\mathbf {x}}_v$:

$$\displaystyle \begin{aligned} \mathcal{L}_1=\sum_{v\in \mathcal{V}}\Vert\hat{\mathbf{x}}_v-{\mathbf{x}}_v\Vert^2. \end{aligned} $$

(6.19)

To encode the structure information, SDNE simply requires that the representations of connected nodes should be close to each other. Thus, the loss function is

$$\displaystyle \begin{aligned} \mathcal{L}_2=\sum_{(v,u)\in \mathcal{E}}\Vert{\mathbf{h}}_v^{K}-{\mathbf{h}}_u^{K}\Vert^2. \end{aligned} $$

(6.20)

Finally, the overall loss function is $\mathcal {L}=\mathcal {L}_{1}+\alpha \mathcal {L}_{2}$, where α is a harmonic hyper-parameter. After the training process, ${\mathbf {h}}_v^{K}$ is taken as the representation of node v and used for downstream tasks.

Experimental results show that SDNE can effectively reconstruct the input graph and achieve better results in several downstream tasks. However, the deep neural network part of SDNE, i.e., the deep autoencoder, is isolated with graph structure during the feed-forward computation, which neglects high-order information interaction among nodes.

6.4.2 Graph Convolutional Networks

Graph convolutional networks (GCNs) aim to generalize convolutional operation from CNNs [55] to the graph domain. The success of CNNs comes from its local connection and multilayer [54] architectures, which may benefit graph modeling as well: (1) graphs are also locally connected; (2) multilayer architectures can help capture the hierarchical patterns in the graph. However, CNNs can only operate on regular Euclidean data like text (1D sequence) and images (2D grid), and cannot be directly transferred to the graph structure. In this subsection, we introduce how GCNs extend the convolutional operation to deal with the non-Euclidean graph data.

Mainstream GCNs usually adopt semi-supervised settings for training, while previous graph embedding methods are mostly unsupervised or self-supervised. Here we only introduce the encoder architectures of GCNs, and omit their loss functions which depend on downstream tasks. In specific, typical GCNs can be divided into spectral and spatial (nonspectral) approaches.

Spectral Methods

From the signal processing perspective, the convolutional operation first transforms a signal to the spectral domain, then modifies the signal with a filter, and finally projects the signal back to the original domain [63]. Spectral GCNs follow the same process and define the convolution operator in the spectral domain of graph signals.

Formally, d-dimensional input representations $\mathbf {X}\in \mathbb {R}^{|\mathcal {V}|\times d}$ of a graph can be seen as d graph signals. Then, spectral GCNs are formulated as

$$\displaystyle \begin{aligned} \mathbf{H} = \mathcal{F}^{-1}(\mathcal{F}(\mathbf{g})\odot\mathcal{F}(\mathbf{X})), \end{aligned} $$

(6.21)

where H is the representations of all nodes, g is the filter in the spatial domain, $\mathcal {F}(\cdot )$ and $\mathcal {F}^{-1}(\cdot )$ indicate the graph Fourier transform (GFT) [9] and inverse GFT respectively, which can be defined as

$$\displaystyle \begin{aligned} \mathcal{F}(\mathbf{X}) = {\mathbf{U}}^{\top}\mathbf{X}, \mathcal{F}^{-1}(\mathbf{X}) = \mathbf{U}\mathbf{X}, \end{aligned} $$

(6.22)

where U is the eigenvector matrix of the normalized graph Laplacian $\mathbf {L} = {\mathbf {I}}_{|\mathcal {V}|} - {\mathbf {D}}^{-\frac {1}{2}} \mathbf {A} {\mathbf {D}}^{-\frac {1}{2}}$. A is adjacency matrix and D is degree matrix.

In practice, we can use a learnable diagonal matrix g_θ to approximate the spectral graph filter $\mathcal {F}(\mathbf {g})$, and the graph convolutional operation can be reformulated as

$$\displaystyle \begin{aligned} \mathbf{H} = \mathbf{U}{\mathbf{g}}_{\theta}{\mathbf{U}}^{\top}\mathbf{X}. \end{aligned} $$

(6.23)

Intuitively, initial graph signals X are transformed into spectral domain by multiplying U^⊤. Then filter g_θ performs the convolution, and U projects graph signals back to their original space. The above form of graph convolution is used in spectral network [10], the first spectral GCN method.

However, the original form in Eq. (6.23) has several limitations: (1) the filter g_θ is not directly related to the graph structure; (2) the kernel size of the graph filter grows with the number of nodes in the graph, which may cause inefficiency and overfitting issues; (3) the calculation of U relies on computationally inefficient matrix factorization. Now we introduce two typical spectral GCNs improving the original formats, including ChebNet [23] and GCN [52].

ChebNet

ChebNet [23] proposes to approximate Ug_θU^⊤ by Chebyshev polynomials T_k(⋅) up to K^th-order, which involve information within K-hop neighborhood. In this way, ChebNet does not need to compute matrix U, and the number of learnable parameters is related to K instead of $|\mathcal {V}|$. Formally, the operation of ChebNet is reformulated as

$$\displaystyle \begin{aligned} \mathbf{H} = \sum_{k=0}^K {\theta}_k {\mathbf{T}}_k(\tilde{\mathbf{L}})\mathbf{X}, \end{aligned} $$

(6.24)

where $\tilde {\mathbf {L}}={2}/{\lambda _{max}}\mathbf {L} - {\mathbf {I}}_{|\mathcal {V}|}$ is the normalized Laplacian matrix, λ_max is the largest eigenvalue of L, and $\theta \in \mathbb {R}^K$ is a weight vector indicating Chebyshev coefficients. The Chebyshev polynomials are defined as

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{T}}_0(\tilde{\mathbf{L}}) & =&\displaystyle {\mathbf{I}}_{|\mathcal{V}|}, \\ {\mathbf{T}}_1(\tilde{\mathbf{L}}) & =&\displaystyle \tilde{\mathbf{L}}, \\ {\mathbf{T}}_k(\tilde{\mathbf{L}}) & =&\displaystyle 2\tilde{\mathbf{L}}{\mathbf{T}}_{k-1}(\tilde{\mathbf{L}}) - {\mathbf{T}}_{k-2}(\tilde{\mathbf{L}}). \end{array} \end{aligned} $$

(6.25)

The K^th-order polynomial can be efficiently computed in a recursive manner, and the parameter number of the graph filter is reduced to K.

GCN

GCN [52] is a first-order approximation of ChebNet. GCN reveals that ChebNet may suffer from the overfitting problem when handling graphs with very wide node degree distributions. Hence, GCN limits the maximum order of Chebyshev polynomials to K = 1, and the equation is simplified to the following form with two trainable scalars $\theta _0^{\prime }$ and $\theta _1^{\prime }$:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbf{H} & =&\displaystyle \theta_0^{\prime} \mathbf{X} + \theta_1^{\prime} \left(\mathbf{L}-{\mathbf{I}}_{|\mathcal{V}|}\right)\mathbf{X} \\ & =&\displaystyle \theta_0^{\prime} \mathbf{X} - \theta_1^{\prime} {\mathbf{D}}^{-\frac{1}{2}}\mathbf{A}{\mathbf{D}}^{-\frac{1}{2}} \mathbf{X}. \end{array} \end{aligned} $$

(6.26)

GCN further reduces the number of parameters to address overfitting by setting $\theta _0^{\prime }=-\theta _1^{\prime }=\theta $. And the equation is reformulated as

$$\displaystyle \begin{aligned} \mathbf{H} = \theta \left({\mathbf{I}}_{|\mathcal{V}|} + {\mathbf{D}}^{-\frac{1}{2}}\mathbf{A}{\mathbf{D}}^{-\frac{1}{2}}\right) \mathbf{X}. \end{aligned} $$

(6.27)

To summarize, spectral GCNs can effectively capture complex global patterns in graphs with spectral graph filters. However, the learned filters of the spectral methods usually depend on the Laplacian eigenbasis of the graph structure. This leads to the problem that a spectral-based GCN model learned on a specific graph cannot be directly transferred to another graph.

Spatial Methods

Different from spectral GCNs, spatial GCNs define graph convolutional operation by directly aggregating information on spatially close neighbors, which is also known as the message-passing process. As shown in Fig. 6.6, the representation ${\mathbf {h}}^{k}_{v}$ of node v at k-th layer can be seen as a function aggregating the representations of v and its neighbors at (k − 1)-th layer:

$$\displaystyle \begin{aligned} {\mathbf{h}}^{k}_{v}=f(\{{\mathbf{h}}^{k-1}_v\} \cup \{{\mathbf{h}}_u^{k-1}| \forall u \in \mathcal{N}_v\}), \end{aligned} $$

(6.28)

where $\mathcal {N}_v$ is the neighbor set of node v.

An illustration depicts two layers, k minus 1 and k, each consisting of a network of interconnected nodes. The node aggregates of the first layer interact with a single node aggregate of the second layer. — **Fig. 6.6**

The key challenge of spatial GCNs is how to define the convolutional operation to satisfy the nodes with different degrees while maintaining the local invariance of CNNs. In this subsection, we introduce two widely used spatial GCNs, including Neural FPs and GraphSAGE.

Neural FPs

Neural FPs [27] propose to use different weight matrices for nodes with different-sized neighborhoods:

$$\displaystyle \begin{aligned} {\mathbf{z}}_v^{k} &= {\mathbf{h}}_{v}^{k-1} + \sum_{u\in \mathcal{N}_v} {\mathbf{h}}_u^{k-1},\\ {\mathbf{h}}^{k}_{v} &= \sigma ( {\mathbf{W}}_{|\mathcal{N}_v|}^{k}{\mathbf{z}}^{k}), \end{aligned} $$

(6.29)

where ${\mathbf {W}}_{|{\mathcal {N}}_v|}^{k}$ is the weight matrix for nodes with degree $|\mathcal {N}_v|$ at layer k and σ(⋅) is a nonlinear function such as Sigmoid. Neural FPs require learning weight matrices for all node degrees in the graph. Hence, when applied to large-scale graphs with diverse node degrees, it cannot capture the invariant information among different node degrees, and needs more parameters as node degrees get larger.

GraphSAGE

GraphSAGE [37] transfers GCNs to handle the inductive setting, where the representations of new nodes should be computed without retraining. Instead of utilizing the full set of neighbors, GraphSAGE learns graph representations by uniformly sampling a fixed-size neighbor set from each node’s local neighborhood. GraphSAGE can be formulated as

$$\displaystyle \begin{aligned} {\mathbf{h}}_{{\mathcal{N}}_v}^{k} &= {\operatorname{Aggregate}} (\{{\mathbf{h}}_u^{k-1}| \forall u\in {\mathcal{N}}_v\} ), \\ {\mathbf{h}}_{v}^{k} &= \sigma ({\mathbf{W}}^{k} \operatorname{concat}({\mathbf{h}}_{v}^{k-1} ; {\mathbf{h}}_{\mathcal{N}_v}^{k} ) ), \end{aligned} $$

(6.30)

where $\mathcal {N}_v$ is the sampled neighbor set of node v and the aggregator functions $\operatorname {Aggregate}(\cdot )$ usually utilize the following three types:

1.
Mean aggregator. By utilizing a mean-pooling aggregator, GraphSAGE can be viewed as the inductive version of the original transductive GCN framework [52], which can be formulated as
$$\displaystyle \begin{aligned} {\mathbf{h}}^{k}_v = \sigma\left({\mathbf{W}}^k\cdot{\operatorname{Mean}}\left(\{{\mathbf{h}}^{k-1}_v\} \cup \{{\mathbf{h}}_u^{k-1}| \forall u \in {{\mathcal{N}}_v}\}\right)\right). \end{aligned} $$
(6.31)
2.
Max-pooling aggregator. Max-pooling aggregator first feeds each neighbor’s hidden representation into a fully connected layer and then utilizes a max-pooling operation to the obtained representations of the node’s neighbors. It can be formulated as
$$\displaystyle \begin{aligned} {\mathbf{h}}_{{\mathcal{N}}_v}^{k} = \operatorname{Max}(\{\sigma({\mathbf{W}}^k{\mathbf{h}}^{k-1}_{u} + {\mathbf{b}}^k)| \forall u \in {\mathcal{N}}_v\}). \end{aligned} $$
(6.32)
3.
LSTM aggregator. GraphSAGE also proposes to use an LSTM-based aggregator with a stronger expressive capability. Since LSTMs process inputs sequentially, GraphSAGE randomly permutes node v’s neighbors to adapt LSTMs.

In summary, GCNs extend the idea of CNNs into the graph domain, enabling models to capture local connectivity of the graph, and have shown their superior abilities in a wide range of downstream tasks compared to the previous autoencoder-based methods. Even now, we can still get state-of-the-art performance by equipping vanilla GCNs with proper training strategies [48] or knowledge distillation methods [123, 124].

6.4.3 Graph Attention Networks

The attention mechanism has shown its strong ability to consider instance importance in learning representations in many NLP applications, such as machine translation [3, 33, 105] and machine reading [18]. Hence, many works [106, 146] focus on generalizing the attention mechanism to the graph domain and hope graph neural networks (GNNs) with attention-based operators can achieve better results by considering the importance of neighbors. Figure 6.7 illustrates the architecture of graph attention networks (GATs).

GAT

GAT [106] proposes to adopt the self-attention mechanism for the information aggregation of GNNs. Specifically, each node’s representation is calculated by attending to its neighbors:

$$\displaystyle \begin{aligned} {\mathbf{h}}^{k}_v = \sigma\left(\sum_{u\in {\mathcal{N}}_v\cup\{v\} } \alpha_{vu}^k {\mathbf{W}}^k{\mathbf{h}}_u^{k-1}\right), \end{aligned} $$

(6.33)

where σ(⋅) is a nonlinear function and $\alpha _{vu}^k$ is the attention coefficient of node pair (v, u) at the k-th layer, which is normalized over node v’s neighbors:

$$\displaystyle \begin{aligned} \alpha_{vu}^k = \frac{\exp\left(\text{LeakyReLU}\left({\mathbf{a}}^{\top}\operatorname{concat}({\mathbf{W}}^k{\mathbf{h}}_v^{k-1};{\mathbf{W}}^k{\mathbf{h}}_u^{k-1})\right)\right)}{\sum_{m\in{\mathcal{N}}_v\cup\{v\}} \exp\left(\text{LeakyReLU}\left({\mathbf{a}}^{\top}\operatorname{concat}({\mathbf{W}}^k{\mathbf{h}}_v^{k-1};{\mathbf{W}}^k{\mathbf{h}}_m^{k-1})\right)\right)}, \end{aligned} $$

(6.34)

where W^k is the weight matrix of a shared linear transformation applied to every node, a is a learnable weight vector and LeakyReLU(⋅) is a nonlinear function.

Moreover, GAT utilizes the multi-head attention mechanism similar to [105] to further aggregate different types of information. Specifically, GAT concatenates (or averages) the output representations of M independent attention heads:

$$\displaystyle \begin{aligned} {\mathbf{h}}^{k}_v = \|{}_{m=1}^M \sigma\left(\sum_{u\in{\mathcal{N}}_v\cup\{v\}}\alpha_{vu}^{k,m}{\mathbf{W}}_m^k {\mathbf{h}}_u^{k-1}\right), \end{aligned} $$

(6.35)

where $\alpha _{vu}^{k,m}$ is the attention coefficient from the m-th attention head at the k-th layer, ${\mathbf {W}}_m^k$ is the transform matrix for the m-th attention head, and ∥ is the concatenation operation.

To summarize, by incorporating the attention mechanism into the information aggregating phase of spatial GCNs, GATs can assign proper weights to different neighbors and offer better interpretability. The ensemble of multiple attention heads further increases the model capacity and brings performance gains over GCNs.

6.4.4 Graph Recurrent Networks

To capture the dependency between two distant nodes by a graph encoder, one has to stack many GNN layers so that the information can propagate from one to another. However, stacking too many GNN layers in the feed-forward computation will cause the over-smoothing issue, which makes node representations less discriminative and harms the performance. Inspired by the success of gated recurrent unit (GRU) [19] and LSTM [42] in modeling long-term dependency in NLP, graph recurrent networks (GRNs) propose to equip the information aggregation of GCNs with gate mechanisms. Similar to the usage in RNNs, the gate mechanisms allow the information to propagate farther without severe gradient vanishment or over-smoothing issues. In this way, GRNs can improve the model’s ability in capturing the long-range dependency across the graph. In this subsection, we introduce several variants of GRNs, including GGNN, Tree-LSTM, and Graph LSTM.

Gated Graph Neural Network (GGNN)

GGNN [57] introduces a GRU-like function to improve the information propagation of the vanilla GCN architecture. In each layer, GGNN updates the representations of nodes by combining the information of their neighbors and themselves with the update and reset gates. Specifically, the recurrence of each layer in GGNN is defined as

$$\displaystyle \begin{aligned} {\mathbf{a}}_v^{k} &=\sum_{u \in \mathcal{N}_v} {\mathbf{h}}_u^{k-1}+ \mathbf{b}, \notag \\ {\mathbf{z}}_v^{k} &= \operatorname{Sigmoid}\left({\mathbf{W}}_z{{\mathbf{a}}_v^{k}}+{\mathbf{U}}_z{\mathbf{h}}_v^{k-1}\right), \notag \\ {\mathbf{r}}_v^{k} &= \operatorname{Sigmoid}\left({\mathbf{W}}_r{\mathbf{a}}_v^{k}+{\mathbf{U}}_r{\mathbf{h}}_v^{k-1}\right), \\ \widetilde{\mathbf{h}}_v^{k} &= \tanh\left({\mathbf{W}}{{\mathbf{a}}_v^{k}}+{\mathbf{U}}({{\mathbf{r}}_v^{k}}\odot{{\mathbf{h}}_v^{k-1}})\right) \notag ,\\ {\mathbf{h}}_v^{k} &= (1-{{\mathbf{z}}_v^{k}}) \odot{{\mathbf{h}}_v^{k-1}}+{{\mathbf{z}}_v^{k}}\odot{\widetilde{{\mathbf{h}}}_v^{k}}, \notag \end{aligned} $$

(6.36)

where ${\mathbf {a}}_{v}^k$ represents node v’s neighborhood information, b is the bias vector, $\widetilde {\mathbf {h}}_v^{k}$ is the candidate representation, z and r are the update and reset gates, and W and U are weight matrices.

Typical RNNs can also be seen as a special case of GGNN, where the graph is a chain structure. Hence, GRNs like GGNN have been widely used in language models. Note that tree structure is very popular in text data, such as the dependency parsing tree. Next, we introduce Tree-LSTM [101], which extends GRNs to model the tree structure.

Tree-LSTM

Tree-LSTM [101] uses an LSTM-based unit with input/output gates i_v/o_v and memory cell c_v to update representation h_v of each tree node v. There are two variants of Tree-LSTM: Child-sum Tree-LSTM and N-ary Tree-LSTM. Instead of using a unified forget gate like LSTM, Child-sum Tree-LSTM assigns a forget gate f_vm for each child m of node v, which allows tree node v to adaptively gather information from its children. N-ary Tree-LSTM requires each node to have at most N children, and assigns different learnable parameters for each child. Child-sum Tree-LSTM is suitable for trees whose children are unordered, and thus can be used for modeling dependency trees. N-ary Tree-LSTM can characterize the diverse relational information for each node’s children, and thus is usually used to model constituency parsing tree structure. Here we only present the formulas of Child-sum Tree-LSTM:

$$\displaystyle \begin{aligned} \widetilde{\mathbf{h}}_v^{k-1} &= \sum_{m \in {\mathcal{N}}_v^c} {\mathbf{h}}_m^{k-1} \notag ,\\ {\mathbf{i}}_v^{k} &= \operatorname{Sigmoid} \Big( {\mathbf{W}}_i {\mathbf{x}}_v + {\mathbf{U}}_i \widetilde{\mathbf{h}}_v^{k-1} + {\mathbf{b}}_i \Big) \notag ,\\ {\mathbf{f}}_{vm}^k &= \operatorname{Sigmoid} \Big( {\mathbf{W}}_f {\mathbf{x}}_v + {\mathbf{U}}_f {\mathbf{h}}_m^{k-1} + {\mathbf{b}}_f \Big) \notag ,\\ {\mathbf{o}}_v^{k} &= \operatorname{Sigmoid} \Big( {\mathbf{W}}_o {\mathbf{x}}_v + {\mathbf{U}}_o \widetilde{{\mathbf{h}}}_v^{k-1} + {\mathbf{b}}_o \Big) ,\\ {\mathbf{u}}_v^k &= \tanh \Big({\mathbf{W}}_u {\mathbf{x}}_v + {\mathbf{U}}_u \widetilde{{\mathbf{h}}}_v^{k-1} + {\mathbf{b}}_u \Big) \notag ,\\ {\mathbf{c}}_v^k &= {\mathbf{i}}_v^k \odot {\mathbf{u}}_v^k + \sum_{m \in {\mathcal{N}}_v} {\mathbf{f}}_{vm}^k \odot {\mathbf{c}}_m^{k-1} \notag ,\\ {\mathbf{h}}_v^k &= {\mathbf{o}}_v^k \odot \tanh({\mathbf{c}}_v^k) \notag , \end{aligned} $$

(6.37)

where ${\mathcal {N}}_v^c$ is the children set of node v and x_v is the input representation for tree node v. Readers can refer to the original paper [101] for the details of N-ary Tree-LSTM variant.

Graph LSTM

Graph LSTM [78, 144] proposes to adapt Tree-LSTM to model the graph structure, and utilizes different weight matrices to represent different labels on the edges. Formally, assume that the edge label between node v and its child m is l. Compared with Eq. (6.37) in Tree-LSTM, Graph LSTM uses label-specific weight matrix U_l to compute relevant gates and hidden states.

In summary, GRNs with gate mechanisms can effectively model the long-range dependencies between distant nodes, which is very important in text modeling. By adapting to tree-structure data, GRNs can also handle the diverse syntactic and semantic relations in text.

6.4.5 Graph Transformers

Transformer [105] has set off a craze in both NLP and CV areas [26, 84]. Based on the powerful self-attention architecture, graph Transformer networks [53, 87, 138] have been proposed to improve the expressive ability of GNNs. The core idea is to leverage the Transformer architecture to capture long-range relationships between distant nodes. Compared with GRNs, graph Transformers can benefit the advantages of Transformers against LSTMs.

Connections with Transformers in Text Modeling

In graph Transformers, nodes are the basic units instead of words, and self-attention mechanism is then performed between all node pairs. In this way, all nodes are directly connected regardless of the original graph structure. To utilize the original topology information, current graph Transformers focus on the modifications of input features and attention coefficients. Other operations including multi-head ensembling, feed-forward network, and layer normalization remain unchanged. Now we will introduce three typical graph Transformers, including Graphormer [138], GraphTrans [116], and SAT [15].

Graphormer

Graphormer [138] designs three structural encoding modules to inject graph structure information into the Transformer architecture. In specific, centrality encoding adds node degrees to the input which can indicate the importance of different nodes:

$$\displaystyle \begin{aligned} {\mathbf{h}}^{0}_v = {\mathbf{x}}_v + {\mathbf{z}}^-_{\text{deg}^-(v)} + {\mathbf{z}}^+_{\text{deg}^+(v)}, \end{aligned} $$

(6.38)

where x_v is the feature vector of node v and z⁻, z⁺ are learnable embedding vectors indexed by node in-degree and out-degree.

Spatial encoding of node distances and edge encoding of edge features serve as bias terms for the attention coefficients in the self-attention layer:

$$\displaystyle \begin{aligned} \alpha_{vu}^k= ({\mathbf{W}}_Q{\mathbf{h}}_v^{k-1})^{\top}({\mathbf{W}}_K{\mathbf{h}}_u^{k-1})/\sqrt {d}+ {b}_{\phi(v,u)}+c_{vu}, \end{aligned} $$

(6.39)

where $\alpha _{vu}^k$ is the attention coefficient between node v and u in the self-attention layer, W_Q, W_K are weight matrices, d is the hidden dimension, b_ϕ(v,u) is the learnable scalar indexed by the distance ϕ(v, u) between v and u, and c_vu is the scalar derived from the edge features between v and u.

To take advantage of deep models in utilizing structure information, GraphTrans [116] and SAT [15] propose to integrate GNN and Transformer architectures for modeling.

GraphTrans

GraphTrans [116] directly stacks a Transformer module on the top of a GNN module. In other words, the output node representations of the GNN are used as the input features of the Transformer. GraphTrans adopts a special [CLS] token which connects to all other nodes, and the representation of [CLS] token after the Transformer is taken as the graph representation. Hence, the Transformer can also be seen as a pooling operator for GNN. As a result, GraphTrans can capture both the local structured information and long-range relationships on graphs at the same time.

SAT

SAT [15] proves that modifying the position encoding module in standard Transformers could not fully capture the structural similarity between nodes on a graph. To address this problem, SAT defines attention coefficients in the Transformer by the similarity of GNN-based representations. In this way, GNN serves as a structure extractor to integrate more information into self-attention layers.

In summary, graph Transformer architecture can model long-range relationships on a graph and go beyond the limitations of traditional deep GNNs, such as over-smoothing. Compared with GNNs limited by the Weisfeiler-Lehman test [119], the Transformer-based methods become more expressive. Besides, GNNs can also be integrated into graph Transformers to better utilize the graph topology information.

6.4.6 Extensions

In this subsection, we will talk about several typical extensions of GNNs, including skip connection, neighborhood sampling, and the modeling of diverse graph types, which can respectively improve the effectiveness, efficiency, and generalizability of GNNs.

GNNs with Skip Connection

Theoretically, we can enhance the expressive ability by stacking more layers of GNNs. However, existing works find that deeper GNNs do not perform better in downstream tasks and even perform worse [52]. Chen et al. [14] further attribute this phenomenon to the low information-noise ratio received by the nodes in deep GNNs. The residual network [40], which has been verified in the computer vision community, is a straightforward solution to the problem. But researchers find that deep GNNs with residual connections still perform worse compared to the two-layer GNNs. Therefore, Rahimi et al. [85] and Xu et al. [120] further explore how to enhance the performance of GNNs with skip connections. Inspired by the idea from the highway network [160], Rahimi et al. [85] employ the layer-wise gate mechanism, and the performance can peak at four layers. Formally, the Highway GCN can be defined as

$$\displaystyle \begin{aligned} \mathbf{T}({\mathbf{h}}^{k-1}_v) &= \operatorname{Sigmoid} \left( {\mathbf{W}}^{k-1} {\mathbf{h}}^{k-1}_v + {\mathbf{b}}^{k-1} \right),\\ {\mathbf{h}}^{k}_v &= {\mathbf{h}}^{k}_v \odot \mathbf{T}({\mathbf{h}}^{k-1}_v) + {\mathbf{h}}^{k-1}_v \odot (1 - \mathbf{T}({\mathbf{h}}^{k-1}_v)). \end{aligned} $$

(6.40)

Besides, Xu et al. [120] present the jump knowledge network, which selects representations from all intermediate layers as final node representations. The selecting mechanism enables the jump knowledge network to adaptively pick the reasonable neighborhood information for each node.

GNNs with Neighborhood Sampling

The vanilla GNN [89] has several limitations: (1) the computation is based on the entire graph Laplacian matrix, and thus computationally expensive for large graphs; (2) it is trained and specialized for a given graph, and cannot be transferred to another graph. To address the problems, the aforementioned GraphSAGE [37] first samples neighborhood nodes of a target node, and then aggregates the representations of all sampled nodes. Thus, GraphSAGE can get rid of the graph Laplacian and can be applied to unseen nodes. Ying et al. [139] further propose the importance-based sampling method PinSage, which simulates random walks starting from target nodes and samples the neighborhood set according to the normalized visit counts. Instead of sampling neighbors for each node, Chen et al. [16] propose FastGCN, which directly samples the receptive field for each layer. In other words, only sampled nodes can participate in layer-wise information propagation. FastGCN sets the sampling importance of nodes according to their degrees, and tends to keep the nodes with larger degrees. Besides, Huang et al. [45] introduce a parameterized and trainable sampler, which performs layer-wise sampling based on the previous layer.

GNNs for Graphs of Diverse Types

The vanilla GNN [89] is designed for undirected graphs, the simplest graph format. However, as shown in Fig. 6.8, we have graphs of diverse types in real-world scenarios. Now we will introduce several extensions of GNNs to deal with directed, heterogeneous, or dynamic graphs:

An illustration depicts different types of graphs based on the directed edge, edge type, and time as follows. Directed graph, heterogeneous graph, and dynamic graph. — **Fig. 6.8**

Directed Graphs

Directed edges contain extra information compared to undirected ones. For example, a famous person in a social network may be followed by lots of other users. Usually, her influence is stronger than most of her followers. This suggests that GNNs should treat the information propagation process in two edge directions differently. DGP [49] defines two graphs where nodes are respectively connected to all their ancestors or descendants. We correspondingly denote the normalized adjacency matrices as ${\mathbf {D}}^{-1}_p{\mathbf {A}}_p$ and ${\mathbf {D}}^{-1}_c{\mathbf {A}}_c$. The encoder of DGP can be formulated as

$$\displaystyle \begin{aligned} {\mathbf{H}}^{k} = \sigma({\mathbf{D}}^{-1}_p{\mathbf{A}}_p \sigma({\mathbf{D}}^{-1}_c{\mathbf{A}}_c{\mathbf{H}}^{k-1}{\mathbf{W}}_c){\mathbf{W}}_p), \end{aligned} $$

(6.41)

where H^k is the representation matrix of all nodes in the k-th layer, W_p and W_c are weight matrices, and σ(⋅) is a nonlinear function. In this way, the information propagations of two directions are processed separately.

Heterogeneous Graphs

A heterogeneous graph [92] has several kinds of nodes. For example, for the graph in the shopping recommendation system, we have user nodes, item nodes, etc. The simplest way to process heterogeneous graphs is to consider the node type information in their input features, i.e., convert the node type information into a one-hot feature vector and concatenate it to the original features. Besides the simple feature-based approach, GraphInception [150] introduces the concept of meta-path into the information propagation on heterogeneous graphs. GraphInception utilizes GNNs to perform information propagation on homogeneous subgraphs, which are extracted based on human-designed meta-paths. At last, it concatenates the outputs from different homogeneous GNNs as final node representations. Wang et al. [111] further propose the heterogeneous graph attention network (HAN) with node-level and meta-path-level attention mechanisms. In addition, some works [112, 154] also consider the modeling of network schema [99], which is a meta template of a heterogeneous graph indicating node types and their relations.

Besides, there are many graphs containing edges with weights or types as additional information. A typical way to handle such graphs is to build a bipartite graph, where the original edges are converted into nodes linking to the original endpoint nodes, and the type information is thus converted to node type information. Another way is to assign different propagation weight matrices for different edge types. However, if a graph has lots of edge types, the parameter numbers will be large. To address the problem, R-GCN [90] introduces two regularization tricks that can decompose the transformation matrix W_r of type r to a set of base transformations shared among all edge types. Thus, R-GCN can reduce the number of parameters and capture the relationship between different edge types.

Dynamic Graphs

Graphs are usually dynamic and vary over time. For example, a user in a social network may newly follow another user. To model the graph structure changing over time, DCRNN [56] and STGCN [143] first capture the static graph information at each time step by GNNs and then feed the output representation into a sequence model like RNNs. In addition, Structural-RNN [47] and ST-GCN [122] extend static graph structure with temporal connections and apply conventional GNNs on the extended graphs, which can capture structural and temporal information at the same time. MetaDyGNN [131] combines GNNs with meta-learning for few-shot link prediction in dynamic graphs.

6.5 From Node Representation to Graph Representation

In previous sections, we introduce how to represent nodes in a graph, from shallow node representation to deep node representation. In many scenarios, we also need to compute the representation of an entire graph or a specific subgraph. Inspired by the pooling operation in NLP and CV areas, graph pooling is then designed for obtaining the graph representation from node representations. Here we present two typical groups of graph pooling methods, namely, flat pooling and hierarchical pooling.

6.5.1 Flat Pooling

Flat pooling assumes a flat graph structure to generate graph representation, which includes max/mean/sum pooling as simple node pooling methods, and SortPooling [147] considering node ordering of structural roles.

Simple Node Pooling

Similar to the pooling operation in NLP and CV, we can directly apply node-wise max/mean/sum operators on top of node representations for graph pooling. The graph representation can be formulated as

$$\displaystyle \begin{aligned} \left\{ \begin{aligned} \max_{v\in \mathcal{V}} \mathbf{v}, \quad \quad &\text{(Graph max-pooling)}\\ \frac{1}{|\mathcal{V}|}\sum_{v\in \mathcal{V}} \mathbf{v}, \quad \quad &\text{(Graph mean-pooling)}\\ \sum_{v\in \mathcal{V}} \mathbf{v}. \quad \quad &\text{(Graph sum-pooling)} \end{aligned} \right. \end{aligned} $$

(6.42)

The above pooling operators are general and parameter-free, but completely neglect the graph structure.

SortPooling

SortPooling [147] first sorts node representations by their structural roles, which enables a consistent node ordering in different graphs and makes it possible to train typical neural networks on sorted node representations for pooling. In particular, SortPooling feeds the sorted representations into a 1-D CNN to get the graph representation, and makes the graph pooling operation to keep more information of global graph topology.

Though simple and effective, flat pooling ignores the hierarchical structure of nodes in a graph, e.g., nodes, subgraphs, and communities, thus leading to suboptimal graph representations.

6.5.2 Hierarchical Pooling

The basic idea of hierarchical pooling is to group structure-related nodes into clusters to form a subgraph recursively, and obtain the graph representation layer by layer. Next, we will introduce two typical hierarchical pooling operations.

DiffPool

DiffPool [140] proposes to learn hierarchical representations at the top of node representations and can be combined with various node representation learning methods in an end-to-end fashion. As shown in Fig. 6.9, DiffPool learns a differentiable soft cluster assignment for each node, and then maps nodes to a set of clusters layer by layer.

An illustration depicts the original graph, which consists of clusters of interconnected nodal networks, which leads to pooled graph level 1, which consists of a single nodal network, which further leads to pooled graph level 2, which consists of a single node. — **Fig. 6.9**

Formally, let ${\mathbf {S}}^{k} \in \mathbb {R}^{C_k\times C_{k+1}}$ denote the learned cluster assignment matrix at the k-th layer, where ${\mathbf {S}}^{k}_{vc}$ indicates whether node v belongs to cluster c at k-th layer, and C_k is the number of clusters in each layer. With the cluster assignment matrix S^k, we can then calculate the adjacency matrix ${\mathbf {A}}^{k+1}\in \mathbb {R}^{C_{k+1}\times C_{k+1}}$ for the next layer by the connectivity strength between learned clusters in S^k:

$$\displaystyle \begin{aligned} {\mathbf{A}}^{k+1} = {{\mathbf{S}}^{k}}^{\top} {\mathbf{A}}^{k}{\mathbf{S}}^{k}. \end{aligned} $$

(6.43)

Then the output node representations H^k+1 are computed by GNN encoder:

$$\displaystyle \begin{aligned} {\mathbf{H}}^{k+1} = \operatorname{GNN}({\mathbf{A}}^{k+1},{\mathbf{X}}^{k+1}), \end{aligned} $$

(6.44)

where input node representations X^k+1 are obtained by aggregating the output representations from the k-th layer:

$$\displaystyle \begin{aligned} {\mathbf{X}}^{k+1} = {{\mathbf{S}}^{k}}^{\top} {\mathbf{H}}^{k}. \end{aligned} $$

(6.45)

DiffPool predefines the number of clusters for each layer, and applies another GNN on the coarsened adjacency matrix A^k to generate the soft cluster assignment matrix S^k. Finally, DiffPool feeds the top-layer graph representation into a downstream task classifier for the supervision of cluster assignment matrices.

gPool

As shown in Fig. 6.10, gPool [32] presents both graph pooling (gPool) and unpooling (gUnpool) operations, based on which the graph data is modeled by an encoder-decoder architecture. The encoder/decoder includes the same number of encoder/decoder blocks, and each encoder/decoder block will contain a GCN layer and a gPool/gUnpool operator. The representations after the last decoder block are used as final representations for downstream tasks.

An illustration depicts the original graph, which consists of two clusters of interconnected nodal networks, which leads to pooled graph level 1, which consists of two single nodal networks, which further leads to pooled graph level 2, which consists of a single node. — **Fig. 6.10**

The gPool operation learns projection scores for each node with a learnable projection vector, and selects nodes with the highest scores as important ones to feed to the next layer:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{s}}^k & = {\mathbf{X}}^{k}{\mathbf{p}}^k/||{\mathbf{p}}^k||, \\ idx^k & = \text{rank}({\mathbf{s}}^k, n^k), \end{array} \end{aligned} $$

(6.46)

where s^k is the importance score; p^k and X^k are the projection vector and input feature matrix in the k-th layer, respectively; and rank(s^k, n^k) returns the indices of the elements with top-n^k scores in s^k.

Then in each layer, we define the adjacency matrix and input feature matrix based on the corresponding rows or columns indexed by idx^k:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{A}}^{k+1} & =&\displaystyle {\mathbf{A}}^k(idx^k, idx^k), \\ \hat{\mathbf{s}}^k & =&\displaystyle \text{Sigmoid}({\mathbf{s}}^k(idx^k)), \\ \hat{\mathbf{X}}^k & =&\displaystyle {\mathbf{X}}^k(idx^k,:), \\ {\mathbf{X}}^{k+1} & =&\displaystyle \hat{\mathbf{X}}^k \odot (\hat{\mathbf{s}}^k{\mathbf{1}}_d^{\top}), \end{array} \end{aligned} $$

(6.47)

where 1_d is a d-dimensional all-one vector and ⊙ is the element-wise matrix multiplication. Here the normalized scores $\hat {\mathbf {s}}^k$ are used as weighted masks to further filter the feature matrix $\hat {\mathbf {X}}^k$.

The gUnpool performs the inverse operation of the gPool operation, which restores the graph to its original structure. Specifically, gUnpool records the indices of selected nodes in the corresponding pooling level and then simply places nodes back in their original positions:

$$\displaystyle \begin{aligned} {\mathbf{X}}^{k-1} = \text{distribute}({\mathbf{0}}_{n^{k-1}\times d}, {\mathbf{X}}^k, idx^k), \end{aligned} $$

(6.48)

where idx^k is the indices of n^k selected nodes in the corresponding gPool level, and the function places row vectors in X^k into n^k−1 × d all-zero feature matrix by index idx^k.

Compared to DiffPool, gPool can reduce the storage complexity by replacing the cluster assignment matrix with a projection vector at each layer.

In this section, we introduce how to obtain global graph representation based on local node representations with graph pooling operations, which are widely used in a series of graph-level tasks such as graph classification and interaction prediction. When applying GNNs in modeling text, graph pooling can effectively help us extract sentence-level [72, 152] or document-level [23, 77] information for downstream tasks.

6.6 Self-Supervised Graph Representation Learning

Recently, self-supervised learning methods, which have made immense success in CV and NLP areas, can learn effective representations with well-designed pre-training tasks instead of expensive downstream task labels. Specifically, self-supervised graph representation learning [60] first learns node and graph representations by different graph-based pre-training tasks without human supervision, such as graph structure reconstruction or pseudo-label prediction. Then, the learned models or representations can be used in downstream tasks (e.g., graph/node classification). As shown in Figs. 6.11 and 6.12, we will introduce three types of self-supervised graph representation learning methods, namely, generative, predictive, and contrastive, distinguished by different pre-training tasks.

Two illustrations. In the generative method, the information from the input graph is passed on to the encoder, which leads to representation, which leads to the decoder, which leads to the input graph, while in the predictive method, the decoder leads to the auxiliary information. — **Fig. 6.11**

An illustration depicts the layout of the contrastive method, in which information from the input graph is passed on to the encoder networks, leading to representations and the contrastive decoder. — **Fig. 6.12**

Generative Methods

The generative methods aim to reconstruct and predict some components (e.g., graphs, edges, node features) of input data.

Structure Reconstruction

These works aim to recover the adjacency matrix or masked edges of a graph. Graph autoencoder (GAE) [51] learns node representations by a two-layer graph convolutional network and then reconstructs the adjacency matrix of the input graph, and variational graph autoencoder (VGAE) [51] is a latent variable variant of GAE. ARGA and ARVGA [75] are adversarial variants of GAE and VGAE, respectively, combining autoencoder and adversarial approaches. AGE [21] adaptively defines the reconstruction objective of adjacency matrices in an iterative manner.

Feature Reconstruction

These works aim to recover the node attributes of a graph, i.e., the input features of GNNs. MGAE [108] takes both corrupted network node content and structures as input, and predicts the origin node features. GALA [76] proposes a symmetric graph convolutional autoencoder based on Laplacian sharpening and smoothing to learn node representations by predicting input features. Here the Laplacian sharpening encourages the reconstructed feature of a node to be far away from those of its neighbors. GPT-GNN [44] uses both graph and feature generation pre-training tasks to model the structural and semantic information of the graph.

Predictive Methods

Predictive methods learn informative representations with self-supervised signals from some auxiliary information, such as pseudo labels and graph properties. Depending on the prediction targets, the predictive methods can be further divided into three types.

Property Prediction

These methods manually define some high-level information of the graph for prediction. GROVER [87] uses a graph Transformer as the encoder and employs both contextual property prediction and graph-level motif prediction to encode domain knowledge. S²GRL [79] takes the k-hop contextual prediction as a pre-training task and trains a well-designed GNN to learn representations. SELAR [46] predicts the meta-paths, which are composite relations of multiple edge types in heterogeneous graphs.

Pseudo-Label Prediction

This line of works employs an iterative framework to learn representations and update cluster labels. M3S [98] employs DeepCluster [13] algorithm to generate pseudo cluster labels, and designs an aligning mechanism and a multistage training framework to predict refined pseudo labels. IFC-GCN [43] proposes an EM-like framework to alternately rectify pseudo labels of feature clustering, and update node features by predicting pseudo labels.

Invariant Information Preserving

These works aim to preserve some intrinsic and invariant information in the graph. Compared with the property prediction-based methods, there is usually no explicit meaning or closed-form formula for the invariant information. CCA-SSG [145] employs the canonical-correlation analysis (CCA) method to maximize the correlation between two augmented views of the same input, and thus preserve augmentation-invariant information. Lagraph [118] assumes that there exists a latent graph without noises behind each observed graph, and the observed graph is randomly generated from the latent one. Lagraph implicitly predicts the latent graph as a pre-training task to learn informative graph and node representations.

Contrastive Methods

Contrastive methods first generate two contrastive views from graphs/nodes, and then maximize the mutual information between them. In this way, the learned representations will be more robust to perturbations. Typically, contrastive methods regard the views from the same graph/node as a positive pair, and the views randomly selected from different graphs/nodes as negative pairs. The representation similarity between a positive pair is forced to be larger than negative ones. Categorized by the views to contrast, contrastive methods can be divided into two groups.

Substructure-Based Methods

This line of works usually contrasts views between different scales of structures. InfoGraph [95] takes different substructures of the original graph (e.g., nodes, edges, triangles) as contrastive views to generate graph-level representations. DGI [107] and GMI [80] build contrast views between a graph and its nodes. MVGRL [39] generates views by sampling subgraphs, and learns both node- and graph-level representations. GCC [82] treats different subgraphs as contrastive views and introduces InfoNCE loss [73] to large-scale graph pre-training.

Augmentation-Based Methods

These works usually generate views by applying different perturbations on input graphs and features. GRACE [158] and GraphCL [142] randomly perturb the input graph (e.g., node and edge dropping) to generate contrastive views. GCA [159] adaptively builds contrastive views based on different graph properties (e.g., node degree, eigenvector, PageRank). Here nodes or edges with small importance scores (e.g., node degrees) are more likely to be dropped in contrastive views. JOAO [141] and AD-GCL [100] automatically select graph augmentations and generate contrastive views in adversarial ways. Instead of contrasting augmented data, DSGC [132] conducts contrastive views from dual spaces of hyperbolic and Euclidean. GASSL [134] generates views by directly perturbing input features and hidden layers. SimGRACE [117] adds Gaussian noises to model parameters as perturbations to generate contrastive views. MA-GCL [34] proposes a novel augmentation strategy that randomly perturbs the neural architecture of GNN encoders (i.e., random permutations of graph filter, linear projection, and nonlinear function) as contrastive views. HeCo [112] employs network schema and meta-path views in heterogeneous graphs as two specific views.

Adaptation Approaches

After the self-supervised training process, there are roughly three paradigms to adapt the learned models or representations for downstream tasks.

Pre-training-Fine-Tuning

These methods [44, 87, 142] first train the model parameters of graph encoder on the datasets without labels in the self-supervised way, and then the pre-trained parameters are used as the initial parameters in the next fine-tuning step, which updates the encoder in a supervised way by downstream tasks.

Unsupervised Representation Learning

These methods [51, 75, 79, 158] train the graph encoder with pre-training tasks in the first stage, and the pre-trained encoder is taken as a feature extractor with frozen parameters to generate representations for downstream tasks in the second stage.

Multitask Training

These methods [46, 82, 98] train graph encoders on both pre-training and downstream tasks with well-designed loss functions, which can be seen as a type of multitask learning, where the pre-training tasks are auxiliary tasks for downstream ones.

In summary, self-supervised graph representation learning methods can learn effective graph and node representations based on various pre-training tasks without labels. Different from pre-trained language models where pre-training-fine-tuning are the mainstream for adaptation of downstream tasks, the unsupervised representation learning paradigm is widely used in graph data, where only the node/graph representations in the last feed-forward layer are fed into downstream tasks as features. An intuitive reason is that popular tasks on graph data (e.g., node/graph classification) require less model capacity than those on NLP (e.g., machine translation). Therefore, the final representations in graph encoders are usually sufficient, and it’s not necessary to fine-tune the learned graph encoders.

6.7 Applications

In this section, we will introduce several typical applications of graph representation learning in the NLP area.

Text Classification

Text classification is an essential task in NLP. Typical GNN models, including GCNs [2, 23, 37, 41, 52, 69] and GATs [106], are applied in text classification to model the structural information (e.g., citation relationship) between documents. However, these works did not fully model the structural information lying in texts. Thus, some works manage to construct graphs from texts. Peng et al. [77] propose to transform texts into a graph of words, and then apply graph convolutional operations to it. Yao et al. [135] propose to construct a text graph with document and word nodes, and utilize GCNs to learn representations for both word nodes and document nodes. Besides constructing the text graph with human heuristics as in the above works, there are also some intrinsic graph structures that can be used, such as dependency parsing trees, semantic parsing graphs, etc. One of the most typical works of utilizing these intrinsic graph structures is the Tree-LSTM [101] introduced in this chapter.

Sequence Labeling

Sequence labeling is another classical task in NLP, which aims to assign labels for each word in the text sequence. Sentence LSTM [151] introduces graph recurrent networks for sentence modeling, where each word node connects with its neighboring words in a context window and a global sentence node links with all word nodes. Then, the hidden representations of word nodes can be used for predicting word labels. Sentence LSTM achieves promising performance in the POS tagging and NER tasks. To solve the semantic role labeling task, Marcheggiani et al. [65] propose a variant of GCN [52] which performs reasoning on syntactic dependency trees (i.e., a graph with labeled edges), and employs an edgewise gate mechanism to consider the information of each dependency edge. They further show that GCNs and LSTMs are functionally complementary in this task.

Knowledge Acquisition

As a crucial subtask for knowledge acquisition, relation extraction aims to predict the relationship between entities in plain text. While CNNs and RNNs have achieved promising results on multiple benchmarks, researchers find the syntactic and semantic information of the text, such as adjacency, syntactic dependencies, and discourse relations, are also helpful for relation extraction. To this end, Zhang et al. [152] propose a GNN-based method for relation extraction, which can efficiently aggregate information from arbitrary dependency graphs. It also applies a novel pruning strategy to the input tree and only keeps the informative words for relation extraction. To model the rich relations across entities, Zhu et al. [157] propose to construct entity graphs and generate edge parameters to propagate diverse relational information, which greatly extends edge types and enables GNNs to conduct complex reasoning. Considering cross-sentence dependencies like coreference and discourse, Peng et al. [78] further explore extracting N-ary relations among multiple entities from multiple sentences by applying Graph LSTMs on document graphs.

Event extraction is another knowledge acquisition task, which aims to identify event triggers and the corresponding arguments for each trigger in texts. Nguyen et al. [72] propose syntactic GCN, which models the dependency tree and learns representations of word nodes to extract events from texts. In addition, Liu et al. [59] find that modeling syntactic relations help capture long-range dependencies better, and these shortcut arcs can help extract multiple events jointly. Meanwhile, to deal with ambiguous and unseen triggers, Zhang et al. [149] summarize event structure knowledge from training data, and construct event background graphs for each event. These graphs help identify correct events by matching the structure knowledge.

Fact Verification

Fact verification aims to retrieve evidence from plain text and verify given claims with the evidence. In other words, we need to label a given claim as SUPPORTED, REFUTED, or NOT ENOUGH INFO, which indicates that the evidence can support, refute, or is not sufficient for validating the claim. By regarding the problem as a natural language inference (NLI) [1] task, traditional methods simply combine the evidence via concatenation, or build models based on evidence-claim pairs. To integrate multiple evidence to reason facts, Zhou et al. [156] propose a graph-based evidence aggregating and reasoning framework, which propagates and aggregates information on a fully connected evidence graph. In this way, different pieces of evidence can have sufficient interactions with each other, and thus can help the situations in which multiple pieces of evidence are necessary for making the decision. Liu et al. [61] further incorporate kernel-based attention mechanism into GAT for fine-grained evidence aggregation, and the proposed KGAT achieves improved performance.

Machine Translation

Traditionally, machine translation (MT) is modeled as a sequence-to-sequence (seq2seq) problem. The graph structure, however, enables MT models to incorporate explicit linguistic biases. To this end, two typical kinds of graphs can be utilized in MT. Some works [4, 5, 36] model syntactic trees with GCNs to learn syntax-aware sentence representations, and show improvements over vanilla seq2seq methods. To involve more semantic information, other works further consider semantic role labeling (SRL) as well as abstract meaning representation (AMR) graphs [64, 94]. Besides traditional MT, GNNs are also utilized in multimodal and document MT. For multimodal MT, Yin et al. [137] build combined graphs with both entities in images and sentences and then apply GNNs for message passing across modalities. For document MT, Xu et al. [121] model long-term dependencies (e.g., coreference) with GNNs on document graphs, and achieve superior performance on translation coherence.

Question Answering

Question answering (QA) aims to generate or find answers for a question based on relevant documents or knowledge bases, which requires models to reason and infer the right answers given the question (see Chap. 4 for an introduction). Since GNNs are designed to model relational data, they are also suitable for QA. To apply GNNs to QA, we need to first build graphs containing question-related entities and their relations. Typically, according to the provided context, researchers could utilize existing knowledge graphs [96, 97, 136] or extract entities and links from documents to construct graphs [11, 24, 29, 30, 58]. Afterward, multiple kinds of GNNs such as GCN, GAT, and R-GCN [90] can be directly used to reason over the graphs [24, 58]. To jointly capture relational and semantic messages, some works explore how to combine powerful PTMs and GNNs for complex reasoning. Ding et al. [24] construct entity graphs utilizing BERT, which predicts node entities and relational edges iteratively. Yasunaga et al. [136] further incorporate PTM-encoded context representations into knowledge graphs to better perform message passing.

Besides the applications in the NLP area, GNNs are widely used in various application scenarios, such as community detection [104, 133, 148], information diffusion prediction [130], recommender systems [8, 28, 115, 129, 139], molecular fingerprints [50], chemical reaction prediction [25], protein interface prediction [31], biomedical engineering [86, 161], etc. Since our book mainly focuses on NLP, readers who are interested in these applications can refer to these papers for more details.

6.8 Summary and Further Readings

In this chapter, we have introduced graph representation learning, which projects graph structure information into continuous vector space and makes deep learning techniques possible on graph data. We first talk about shallow node representation, from spectral clustering, shallow neural networks, to matrix factorization. Then we introduce deep node representation, from autoencoder-based methods to graph neural networks. Afterward, we present how to obtain the global graph representation from node representations. Finally, we introduce how graph representation learning helps a series of NLP tasks.

For further readings of graph representation learning, there are also some recommended reviews and books. In terms of general graph embedding, Goyal and Ferrara [35] and Cui et al. [22] conduct surveys on relevant models and their applications. We also write a monograph [126] about our systematic work on the topic of network embedding. In terms of GNNs, Wu et al. [114] write a book covering more than 20 topics of GNNs in 4 parts: introduction, foundations, frontiers, and applications. Shi et al. [93] write a monograph providing a launch point for discussing the latest trends of GNNs. Wu et al. [113] present a survey about the NLP applications based on GNNs. Shi et al. [92] focus on the representation learning of heterogeneous graphs. We also provide a more comprehensive survey of GNNs in our review [155], covering a broader range of aspects, such as the applications on images or chemistry.

References

Gabor Angeli and Christopher D Manning. Naturalli: Natural logic inference for common sense reasoning. In Proceedings of EMNLP, 2014.
Google Scholar
James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Proceedings of NeurIPS, 2016.
Google Scholar
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, 2015.
Google Scholar
Jasmijn Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. Graph convolutional encoders for syntax-aware neural machine translation. In Proceedings of EMNLP, 2017.
Google Scholar
Daniel Beck, Gholamreza Haffari, and Trevor Cohn. Graph-to-sequence learning using gated graph neural networks. In Proceedings of ACL, 2018.
Google Scholar
Slobodan Beliga, Ana Meštrović, and Sanda Martinčić-Ipšić. An overview of graph-based keyword extraction methods and approaches. Journal of information and organizational sciences, 39(1):1–20, 2015.
Google Scholar
Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of NeurIPS, 2001.
Google Scholar
Rianne van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.
Google Scholar
Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications, volume 31999. McGraw-Hill New York, 1986.
Google Scholar
Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally connected networks on graphs. In Proceedings of ICLR, 2014.
Google Scholar
Nicola De Cao, Wilker Aziz, and Ivan Titov. Question answering by reasoning across documents with graph convolutional networks. In Proceedings of NAACL, 2019.
Google Scholar
Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of CIKM, 2015.
Google Scholar
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of ECCV, 2018.
Google Scholar
Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of AAAI, 2020.
Google Scholar
Dexiong Chen, Leslie O’Bray, and Karsten Borgwardt. Structure-aware transformer for graph representation learning. In Proceedings of ICML, 2022.
Google Scholar
Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast learning with graph convolutional networks via importance sampling. In Proceedings of ICLR, 2018.
Google Scholar
Mo Chen, Qiong Yang, and Xiaoou Tang. Directed graph embedding. In Proceedings of IJCAI, 2007.
Google Scholar
Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. In Proceedings of EMNLP, 2016.
Google Scholar
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of EMNLP, 2014.
Google Scholar
Wojciech Chojnacki and Michael J Brooks. A note on the locally linear embedding algorithm. International Journal of Pattern Recognition and Artificial Intelligence, 23(08):1739–1752, 2009.
Google Scholar
Ganqu Cui, Jie Zhou, Cheng Yang, and Zhiyuan Liu. Adaptive graph encoder for attributed graph embedding. In Proceedings of SIGKDD, 2020.
Google Scholar
Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering, 2018.
Google Scholar
Michael Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of NeurIPS, 2016.
Google Scholar
Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. Cognitive graph for multi-hop reading comprehension at scale. In Proceedings of ACL, 2019.
Google Scholar
Kien Do, Truyen Tran, and Svetha Venkatesh. Graph transformation policy network for chemical reaction prediction. In Proceedings of KDD, 2019.
Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of ICLR, 2021.
Google Scholar
David K Duvenaud, Dougal Maclaurin, Jorge Aguileraiparraguirre, Rafael Gomezbombarelli, Timothy D Hirzel, Alan Aspuruguzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Proceedings of NeurIPS, 2015.
Google Scholar
Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In Proceedings of WWW, 2019.
Google Scholar
Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jingjing Liu. Hierarchical graph network for multi-hop question answering. In Proceedings of EMNLP, 2020.
Google Scholar
Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren. Scalable multi-hop relational reasoning for knowledge-aware question answering. In Proceedings of EMNLP, 2020.
Google Scholar
Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. Protein interface prediction using graph convolutional networks. In Proceedings of NeurIPS, 2017.
Google Scholar
Hongyang Gao and Shuiwang Ji. Graph u-nets. In Proceedings of ICML. PMLR, 2019.
Google Scholar
Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. A convolutional encoder model for neural machine translation. In Proceedings of ACL, 2017.
Google Scholar
Xumeng Gong, Cheng Yang, and Chuan Shi. Ma-gcl: Model augmentation tricks for graph contrastive learning. In Proceedings of AAAI, 2023.
Google Scholar
Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
Article Google Scholar
Zhijiang Guo, Yan Zhang, Zhiyang Teng, and Wei Lu. Densely connected graph convolutional networks for graph-to-sequence learning. Transactions of the Association for Computational Linguistics, 2019.
Google Scholar
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of NeurIPS, 2017.
Google Scholar
William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. IEEE Data(base) Engineering Bulletin, 40:52–74, 2017.
Google Scholar
Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In Proceedings of ICML, 2020.
Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of CVPR, 2016.
Google Scholar
Mikael Henaff, Joan Bruna, and Yann Lecun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.
Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997.
Google Scholar
Zhihui Hu, Guang Kou, Haoyu Zhang, Na Li, Ke Yang, and Lin Liu. Rectifying pseudo labels: Iterative feature clustering for graph representation learning. In Proceedings of CIKM, 2021.
Google Scholar
Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. GPT-GNN: Generative pre-training of graph neural networks. In Proceedings of SIGKDD, 2020.
Google Scholar
Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. Adaptive sampling towards fast graph representation learning. In Proceedings of NeurIPS, 2018.
Google Scholar
Dasol Hwang, Jinyoung Park, Sunyoung Kwon, KyungMin Kim, Jung-Woo Ha, and Hyunwoo J Kim. Self-supervised auxiliary learning with meta-paths for heterogeneous graphs. Proceedings of NeurIPS, 2020.
Google Scholar
Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of CVPR, 2016.
Google Scholar
AJAY KUMAR JAISWAL, Peihao Wang, Tianlong Chen, Justin F Rousseau, Ying Ding, and Zhangyang Wang. Old can be gold: Better gradient flow can make vanilla-gcns great again. In Advances in Neural Information Processing Systems, 2022.
Google Scholar
Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P Xing. Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of CVPR, 2019.
Google Scholar
Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of Computer-aided Molecular Design, 30(8):595–608, 2016.
Article Google Scholar
Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
Google Scholar
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of ICLR, 2017.
Google Scholar
Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Proceedings of NeurIPS, 2021.
Google Scholar
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
Google Scholar
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
Google Scholar
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In Proceedings of ICLR, 2018.
Google Scholar
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S Zemel. Gated graph sequence neural networks. In Proceedings of ICLR, 2016.
Google Scholar
Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren. Kagnet: Knowledge-aware graph networks for commonsense reasoning. In Proceedings of EMNLP, 2019.
Google Scholar
Xiao Liu, Zhunchen Luo, and He-Yan Huang. Jointly multiple events extraction via attention-based graph information aggregation. In Proceedings of EMNLP, 2018.
Google Scholar
Yixin Liu, Ming Jin, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and Philip Yu. Graph self-supervised learning: A survey. IEEE Transactions on Knowledge and Data Engineering, 2022.
Google Scholar
Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. Fine-grained fact verification with kernel graph attention network. In Proceedings of ACL, 2020.
Google Scholar
Zhiyuan Liu, Yankai Lin, and Maosong Sun. Representation Learning for Natural Language Processing. Springer, 2020.
Book Google Scholar
Stéphane Mallat. A wavelet tour of signal processing. Elsevier, 1999.
MATH Google Scholar
Diego Marcheggiani, Jasmijn Bastings, and Ivan Titov. Exploiting semantics in neural machine translation with graph convolutional networks. In Proceedings of NAACL-HLT, 2018.
Google Scholar
Diego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of EMNLP, 2017.
Google Scholar
Rada Mihalcea. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of ACL, 2004.
Google Scholar
Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of EMNLP, 2004.
Google Scholar
T Mikolov and J Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of NeurIPS, 2013.
Google Scholar
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of CVPR, 2017.
Google Scholar
Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74(3):036104, 2006.
Google Scholar
Mark EJ Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, 2006.
Google Scholar
Thien Nguyen and Ralph Grishman. Graph convolutional networks with argument-aware pooling for event detection. In Proceedings of AAAI, 2018.
Google Scholar
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Google Scholar
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
Google Scholar
Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang. Adversarially regularized graph autoencoder for graph embedding. Proceedings of IJCAI, 2018.
Google Scholar
Jiwoong Park, Minsik Lee, Hyung Jin Chang, Kyuewang Lee, and Jin Young Choi. Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In Proceedings of ICCV, 2019.
Google Scholar
Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song, and Qiang Yang. Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In Proceedings of WWW, 2018.
Google Scholar
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. Cross-sentence n-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics, 5:101–115, 2017.
Article Google Scholar
Zhen Peng, Yixiang Dong, Minnan Luo, Xiao-Ming Wu, and Qinghua Zheng. Self-supervised graph representation learning via global context prediction. arXiv preprint arXiv:2003.01604, 2020.
Google Scholar
Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, and Junzhou Huang. Graph representation learning via graphical mutual information maximization. In Proceedings of WWW, 2020.
Google Scholar
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: Online learning of social representations. In Proceedings of KDD, 2014.
Google Scholar
Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of SIGKDD, 2020.
Google Scholar
Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of WSDM, 2018.
Google Scholar
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
Google Scholar
Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. Semi-supervised user geolocation via graph convolutional networks. In Proceedings of ACL, 2018.
Google Scholar
Sungmin Rhee, Seokjun Seo, and Sun Kim. Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. In Proceedings of IJCAI, 2018.
Google Scholar
Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. In Proceedings of NeurIPS, 2020.
Google Scholar
Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
Google Scholar
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE TNN 2009, 20(1):61–80, 2009.
Google Scholar
Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In Proceedings of ESWC, 2018.
Google Scholar
Chuan Shi, Binbin Hu, Wayne Xin Zhao, and S Yu Philip. Heterogeneous information network embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering, 31(2):357–370, 2018.
Google Scholar
Chuan Shi, Xiao Wang, and S Yu Philip. Heterogeneous graph representation learning and applications, 2022.
Google Scholar
Chuan Shi, Xiao Wang, and Cheng Yang. Advances in Graph Neural Networks. Springer, 2022.
MATH Google Scholar
Linfeng Song, Daniel Gildea, Yue Zhang, Zhiguo Wang, and Jinsong Su. Semantic neural machine translation using amr. Transactions of the Association for Computational Linguistics, 2019.
Google Scholar
Fan-Yun Sun, Jordan Hoffman, Vikas Verma, and Jian Tang. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In Proceedings of ICLR, 2019.
Google Scholar
Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. In Proceedings of EMNLP, 2019.
Google Scholar
Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William W Cohen. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of EMNLP, 2018.
Google Scholar
Ke Sun, Zhouchen Lin, and Zhanxing Zhu. Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes. In Proceedings of AAAI, 2020.
Google Scholar
Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 4(11):992–1003, 2011.
Google Scholar
Susheel Suresh, Pan Li, Cong Hao, and Jennifer Neville. Adversarial graph augmentation to improve graph contrastive learning. Proceedings of NeurIPS, 2021.
Google Scholar
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of ACL, 2015.
Google Scholar
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Large-scale information network embedding. In Proceedings of WWW, 2015.
Google Scholar
Lei Tang and Huan Liu. Relational learning via latent social dimensions. In Proceedings of KDD, 2009.
Google Scholar
Anton Tsitsulin, John Palowitch, Bryan Perozzi, and Emmanuel Muller. Graph clustering with graph neural networks. arXiv preprint arXiv:2006.16904, 2020.
Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Jakob Uszkoreit, Aidan N Gomez, and Lukasz Kaiser. Attention is all you need. In Proceedings of NeurIPS, 2017.
Google Scholar
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In Proceedings of ICLR, 2018.
Google Scholar
Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In Proceedings of ICLR, 2018.
Google Scholar
Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. Mgae: Marginalized graph autoencoder for graph clustering. In Proceedings of CIKM, 2017.
Google Scholar
Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceedings of KDD, 2016.
Google Scholar
Xiao Wang, Deyu Bo, Chuan Shi, Shaohua Fan, Yanfang Ye, and S Yu Philip. A survey on heterogeneous graph embedding: methods, techniques, applications and sources. IEEE Transactions on Big Data, 2022.
Google Scholar
Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Heterogeneous graph attention network. In Proceedings of WWW, 2019.
Google Scholar
Xiao Wang, Nian Liu, Hui Han, and Chuan Shi. Self-supervised heterogeneous graph neural network with co-contrastive learning. In Proceedings of SIGKDD, 2021.
Google Scholar
Lingfei Wu, Yu Chen, Kai Shen, Xiaojie Guo, Hanning Gao, Shucheng Li, Jian Pei, and Bo Long. Graph neural networks for natural language processing: A survey. arXiv preprint arXiv:2106.06090, 2021.
Google Scholar
Lingfei Wu, Peng Cui, Jian Pei, and Liang Zhao. Graph Neural Networks: Foundations, Frontiers, and Applications. Springer Singapore, Singapore, 2022.
Google Scholar
Qitian Wu, Hengrui Zhang, Xiaofeng Gao, Peng He, Paul Weng, Han Gao, and Guihai Chen. Dual graph attention networks for deep latent representation of multifaceted social effects in recommender systems. In Proceedings of WWW, 2019.
Google Scholar
Zhanghao Wu, Paras Jain, Matthew Wright, Azalia Mirhoseini, Joseph E Gonzalez, and Ion Stoica. Representing long-range context for graph neural networks with global attention. Proceedings of NeurIPS, 2021.
Google Scholar
Jun Xia, Lirong Wu, Jintao Chen, Bozhen Hu, and Stan Z Li. Simgrace: A simple framework for graph contrastive learning without data augmentation. In Proceedings of WWW, 2022.
Google Scholar
Yaochen Xie, Zhao Xu, and Shuiwang Ji. Self-supervised representation learning via latent graph prediction. Proceedings of ICML, 2022.
Google Scholar
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? Proceedings of ICLR, 2019.
Google Scholar
Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In Proceedings of ICML, 2018.
Google Scholar
Mingzhou Xu, Liangyou Li, Derek Wong, Qun Liu, Lidia S Chao, et al. Document graph for neural machine translation. In Proceedings of EMNLP, 2020.
Google Scholar
Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of AAAI, 2018.
Google Scholar
Cheng Yang, Yuxin Guo, Yao Xu, Chuan Shi, Jiawei Liu, Chunchen Wang, Xin Li, Ning Guo, and Hongzhi Yin. Learning to distill graph neural networks. In Proceedings of WSDM, 2023.
Google Scholar
Cheng Yang, Jiawei Liu, and Chuan Shi. Extract the knowledge of graph neural networks and go beyond it: An effective knowledge distillation framework. In Proceedings of WWW, 2021.
Google Scholar
Cheng Yang and Zhiyuan Liu. Comprehend deepwalk as matrix factorization. arXiv preprint arXiv:1501.00358, 2015.
Google Scholar
Cheng Yang, Zhiyuan Liu, Cunchao Tu, Chuan Shi, and Maosong Sun. Network embedding: Theories, methods, and applications. Synthesis Lectures on Artificial Intelligence and Machine Learning, 15(2):1–242, 2021.
Article MATH Google Scholar
Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. Network representation learning with rich text information. In Proceedings of IJCAI, 2015.
Google Scholar
Cheng Yang, Maosong Sun, Zhiyuan Liu, and Cunchao Tu. Fast network embedding enhancement via high order proximity approximation. In Proceedings of IJCAI, 2017.
Google Scholar
Cheng Yang, Maosong Sun, Wayne Xin Zhao, Zhiyuan Liu, and Edward Y Chang. A neural network approach to jointly modeling social networks and mobile trajectories. ACM Transactions on Information Systems, 35(4):1–28, 2017.
Google Scholar
Cheng Yang, Jian Tang, Maosong Sun, Ganqu Cui, and Liu Zhiyuan. Multi-scale information diffusion prediction with reinforced recurrent networks. In Proceedings of IJCAI, 2019.
Google Scholar
Cheng Yang, Chunchen Wang, Yuanfu Lu, Xumeng Gong, Chuan Shi, Wei Wang, and Xu Zhang. Few-shot link prediction in dynamic networks. In Proceedings of WSDM, 2022.
Google Scholar
Haoran Yang, Hongxu Chen, Shirui Pan, Lin Li, Philip S Yu, and Guandong Xu. Dual space graph contrastive learning. Proceedings of WWW, 2022.
Google Scholar
Jaewon Yang and Jure Leskovec. Overlapping community detection at scale: a nonnegative matrix factorization approach. In Proceedings of WSDM, 2013.
Google Scholar
Longqi Yang, Liangliang Zhang, and Wenjing Yang. Graph adversarial self-supervised learning. Proceedings of NeurIPS, 2021.
Google Scholar
Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text classification. In Proceedings of AAAI, 2019.
Google Scholar
Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Proceedings of NAACL, 2021.
Google Scholar
Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of ACL, 2020.
Google Scholar
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? Proceedings of NeurIPS, 2021.
Google Scholar
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of KDD, 2018.
Google Scholar
Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Proceedings of NeurIPS, 2018.
Google Scholar
Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. Graph contrastive learning automated. In Proceedings of ICML, 2021.
Google Scholar
Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. In Proceedings of NeurIPS, 2020.
Google Scholar
Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of ICLR, 2018.
Google Scholar
Victoria Zayats and Mari Ostendorf. Conversation modeling on reddit using a graph-structured LSTM. Transactions of the Association for Computational Linguistics, 6:121–132, 2018.
Article Google Scholar
Hengrui Zhang, Qitian Wu, Junchi Yan, David Wipf, and Philip S Yu. From canonical correlation analysis to self-supervised graph neural networks. In Proceedings of NeurIPS, 2021.
Google Scholar
Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit Yan Yeung. GaAN: Gated attention networks for learning on large and spatiotemporal graphs. In Proceedings of UAI, 2018.
Google Scholar
Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In Proceedings of AAAI, 2018.
Google Scholar
Xiaotong Zhang, Han Liu, Qimai Li, and Xiao-Ming Wu. Attributed graph clustering via adaptive graph convolution. In Proceedings of IJCAI, 2019.
Google Scholar
Yilin Zhang, Ziran Li, Zhiyuan Liu, Hai-Tao Zheng, Ying Shen, and Lan Zhou. Event detection with dynamic word-trigger-argument graph neural networks. IEEE Transactions on Knowledge and Data Engineering, 2021.
Google Scholar
Yizhou Zhang, Yun Xiong, Xiangnan Kong, Shanshan Li, Jinhong Mi, and Yangyong Zhu. Deep collective classification in heterogeneous information networks. In Proceedings of WWW, 2018.
Google Scholar
Yue Zhang, Qi Liu, and Linfeng Song. Sentence-state LSTM for text representation. In Proceedings of ACL, 2018.
Google Scholar
Yuhao Zhang, Peng Qi, and Christopher D Manning. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of EMNLP, 2018.
Google Scholar
Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Maosong Sun, Zhichong Fang, Bo Zhang, and Leyu Lin. Cosine: compressive network embedding on large-scale information networks. IEEE Transactions on Knowledge and Data Engineering, 2020.
Google Scholar
Jianan Zhao, Xiao Wang, Chuan Shi, Zekuan Liu, and Yanfang Ye. Network schema preserving heterogeneous information network embedding. In International Joint Conference on Artificial Intelligence (IJCAI), 2020.
Google Scholar
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.
Article Google Scholar
Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. GEAR: Graph-based evidence aggregating and reasoning for fact verification. In Proceedings of ACL 2019, 2019.
Google Scholar
Hao Zhu, Yankai Lin, Zhiyuan Liu, Jie Fu, Tat-Seng Chua, and Maosong Sun. Graph neural networks with generated parameters for relation extraction. In Proceedings of ACL, 2019.
Google Scholar
Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131, 2020.
Google Scholar
Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Graph contrastive learning with adaptive augmentation. In Proceedings of WWW, 2021.
Google Scholar
Julian G Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jurgen Schmidhuber. Recurrent highway networks. In Proceedings of ICML, 2016.
Google Scholar
Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Intelligent Systems in Molecular Biology, 34(13):258814, 2018.
Google Scholar

Download references

Acknowledgements

The contributions of all authors for the second edition are Zhiyuan Liu, Yankai Lin, and Maosong Sun designed the overall architecture of this chapter. Cheng Yang and Yankai Lin drafted this chapter. Zhiyuan Liu and Yankai Lin proofread and revised this chapter.

We thank Ganqu Cui and Chaojun Xiao for drawing figures, and thank Yujia Qin, Ning Ding, Ganqu Cui, Yuqi Luo, and Ziqing Qiao for proofreading the chapter. We also thank Jie Zhou and Zhengyan Zhang for preparing some initial draft materials for the first edition.

This is the graph representation learning chapter of the second edition of the book Representation Learning for Natural Language Processing, with its first edition published in 2020 [62]. As compared to the first edition of this chapter, the main changes include the following: (1) we reorganized the narrative logic of this chapter, by dividing it into shallow node representation, deep node representation, graph representation, self-supervised graph learning, and applications; (2) we rewrote and updated the part of graph neural networks; and (3) we added the contents of graph Transformer and self-supervised graph representation learning.

Author information

Authors and Affiliations

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
Cheng Yang
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yankai Lin
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Zhiyuan Liu & Maosong Sun

Authors

Cheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yankai Lin
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Maosong Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyuan Liu .

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Zhiyuan Liu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yankai Lin
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Maosong Sun

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yang, C., Lin, Y., Liu, Z., Sun, M. (2023). Graph Representation Learning. In: Liu, Z., Lin, Y., Sun, M. (eds) Representation Learning for Natural Language Processing. Springer, Singapore. https://doi.org/10.1007/978-981-99-1600-9_6

Download citation

DOI: https://doi.org/10.1007/978-981-99-1600-9_6
Published: 24 August 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1599-6
Online ISBN: 978-981-99-1600-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics