Deep Graph Similarity Learning: A Survey

In many domains where data are represented as graphs, learning a similarity metric among graphs is considered a key problem, which can further facilitate various learning tasks, such as classification, clustering, and similarity search. Recently, there has been an increasing interest in deep graph similarity learning, where the key idea is to learn a deep learning model that maps input graphs to a target space such that the distance in the target space approximates the structural distance in the input space. Here, we provide a comprehensive review of the existing literature of deep graph similarity learning. We propose a systematic taxonomy for the methods and applications. Finally, we discuss the challenges and future directions for this problem.

sures automatically from data is the primary aim of similarity learning. Similarity/Metric learning refers to learning a function to measure the distance or similarity between objects, which is a critical step in many machine learning problems, such as classification, clustering, ranking, etc. For example, in k-Nearest Neighbor (kNN) classification [25], a metric is needed for measuring the distance between data points and identifying the nearest neighbors; in many clustering algorithms, similarity measurements between data points are used to determine the clusters. Although there are some general metrics like Euclidean distance that can be used for getting similarity measure between objects represented as vectors, these metrics often fail to capture the specific characteristics of the data being studied, especially for structured data. Therefore, it is essential to find or learn a metric for measuring the similarity of data points involved in the specific task.
Metric learning has been widely studied in many fields on various data types. For instance, in computer vision, metric learning has been explored on images or videos for image classification, object recognition, visual tracking, and other learning tasks [72,42,50]. In information retrieval such as in the search engines, metric learning has been used to determine the ranking of relevant documents to a given query [58,61]. In this paper, we survey the existing work in similarity learning for graphs, which encode relational structures and are ubiquitous in various domains.
Similarity learning for graphs has been studied for many real applications, such as molecular graph classification in chemoinformatics [45,32], proteinprotein interaction network analysis for disease prediction [18], binary function similarity search in computer security [60], multi-subject brain network similarity learning for neurological disorder analysis [55], etc. In many of these application scenarios, the number of training samples available is often very limited, making it a difficult problem to directly train a classification or prediction model. With graph similarity learning strategies, these applications benefit from pairwise learning which utilizes every pair of training samples to learn a metric for mapping the input data to the target space, which further facilitates the specific learning task.
In the past few decades, many techniques have emerged for studying the similarity of graphs. Early on, multiple graph similarity metrics were defined, such as the Graph Edit Distance [21], Maximum Common Subgraph [22,98], and Graph Isomorphism [29,16], to address the problem of graph similarity search and graph matching. However, the computation of these metrics is an NP-complete problem in general [113]. Although some pruning strategies and heuristic methods have been proposed to approximate the values and speed up the computation, it is difficult to analyze the computational complexities of the above heuristic algorithms and the sub-optimal solutions provided by them are also unbounded [113]. Therefore, these approaches are feasible only for graphs of relatively small size and in practical applications where these metrics are of primary interest. Thus it is hard to adapt these methods to new tasks. More recently, researchers started to formulate similarity estimation as a learning problem where the goal is to learn a model that maps a pair of graphs to a similarity score based on the graph representations. For example, graph kernels such as path-based kernels [17] and the subgraph matching kernel [107,110] were proposed for graph similarity learning. Traditional graph embedding techniques, such as geometric embedding, are also leveraged for graph similarity learning [51]. Although these methods have shown some effectiveness, they generally are not capable of learning well on various kinds of graphs, especially in domains the graphs have more complicated structures that are difficult to capture with these traditional graph representation learning based methods. With the emergence of deep learning techniques, graph neural networks have become a powerful new tool for learning representations on graphs with various structures. Deep graph similarity learning has also emerged as a new strategy for graph similarity learning problems in different domains. For instance, some GNN-based graph similarity predictive models have been introduced for chemical compound queries in computational chemistry [12] and brain connectivity network analysis in neuroscience [55,65]. Some deep graph matching networks have also been proposed for binary function similarity search and malware detection in computer security [60,101].
In this survey paper, we provide a systematic review of the existing work in deep graph similarity learning. Based on the different graph representation learning strategies and how they are leveraged for the deep graph similarity learning task, we propose to categorize deep graph similarity learning models into three groups: Graph Embedding based-methods, GNN-based methods, and Deep Graph Kernel-based methods. Additionally, we sub-categorize the models based on their properties. Table 2 shows our proposed taxonomy, with some example models for each category as well as the relevant applications. In this survey, we will illustrate how these different categories of models approach the graph similarity learning problem. We will also discuss the loss functions used for the graph similarity learning task.
Scope and Contributions. This paper is focused on surveying the recently emerged deep models for graph similarity learning, where the goal is to use deep strategies on graphs for learning the similarity of given pairs of graphs, instead of computing similarity scores based on predefined measures. We emphasize that this paper does not attempt to survey the extensive literature on graph representation learning, graph neural networks, and graph embedding. Prior work has focused on these topics (see [23,39,57,105,82,26,114] for examples). Here instead, we focus on deep graph representation learning methods that explicitly focus on modeling graph similarity. To the best of our knowledge, this is the first survey paper on this problem. We summarize the main contributions of this paper as follows: • A comprehensive taxonomy to categorize the literature of the emerging field of deep graph similarity learning. • Summary and discussion of the key functions and building blocks of the relevant models in the literature. • Summary and comparison of the different deep graph similarity learning models across the taxonomy.
Similarity function s ij Similarity score between two graphs Convolution of g θ and x • Summary and discussion of the real-world applications that can benefit from deep graph similarity learning in a variety of domains. • Summary and discussion of the major challenges for deep graph similarity learning, the future directions, and the open problems.
Organization. The rest of the paper is organized as follows. In Section 2, we introduce notation, preliminary concepts, and define the graph similarity learning problem. In Section 3, we introduce the taxonomy with detailed illustrations of the existing deep models. In Section 4, we present the applications of deep graph similarity learning in various domains. In Section 5, we discuss the remaining challenges in this area and highlight future directions. Finally, we conclude in Section 6.

Notation and Preliminaries
In this section, we provide the necessary notation and definitions of the fundamental concepts pertaining to the graph similarity problem that will be used throughout this survey. The notation is summarized in Table 1. Let G = (V, E, A) denote a graph, where V is the set of nodes, E ⊆ V × V is the set of edges, and A ∈ R |V |×|V | is the adjacency matrix of the graph. This is a general notation for graphs that covers different types of graphs, including unweighted/weighted graphs, undirected/directed graphs, and attributed/non-attributed graphs.
We are also assuming a set of graphs as input, G = {G 1 , G 2 , . . . , G n }, and the goal is measure/model their pairwise similarity. This relates to the classical problem of graph isomorphism and its variants. In graph isomorphism [74], two The graph isomorphism is an NP, and no efficient algorithms are known for it. Subgraph isomorphism is a generalization of the graph isomorphism problem. In subgraph isomorphism, the goal is to answer for two input graphs G and H, if there is a subgraph of G (G ⊂ G) such that G is isomorphic to H (i.e., G ∼ = H). This is suitable in a setting in which the two graphs have different sizes. The subgraph isomorphism problem has been proven to be NPcomplete (unlike the graph isomorphism problem) [37]. The maximum common subgraph problem is another less-restrictive measure of graph similarity, in which the similarity between two graphs is defined based on the size of the largest common subgraph in the two input graphs. However, this problem is also NP-complete [37].
Definition 1 (Graph Similarity Learning) Let G be a input set of graphs, Let M denote a similarity function with parameters W ∈ R, such that M : (G i , G j ) → R, for any pair of graphs G i , G j ∈ G. Assume s ij ∈ R denote the similarity score computed by M between pairs G i and G j , then M is symmetric iff s ij = s ji for any pair of graphs G i , G j ∈ G. And M is optimal if s ii > s ij for any pair of graphs Clearly, graph isomorphism and its related variants (e.g., subgraph isomorphism, maximum common subgraphs, etc.) are focused on measuring the topological equivalence of graphs, which gives rise to a binary similarity measure that outputs 1 if two graphs are isomorphic and 0 otherwise. While these methods may sound intuitive, they are actually very restrictive and difficult to compute for large graphs. Here instead, we focus on a relaxed notion of graph similarity that can be measured using machine learning models, where the goal is to learn a model that quantifies the degree of structural similarity and relatedness between two graphs. This is slightly similar to the work done on modeling the structural similarity between nodes in the same graph [80,7]. We formally state the definition of graph similarity learning (GSL) in Definition 1.

Taxonomy of Models
In this section, we describe a taxonomy for the literature of deep graph similarity learning. The characteristics and applications of all the methods are summarized in Table 2. We organize the existing methods into three main categories and describe them next: 1. Graph embedding based GSL 2. Graph Neural Network based GSL 3. Deep graph kernel based GSL

Graph Embedding based Graph Similarity Learning
Graph embedding has received considerable attention in the past decade [26,114], and a variety of deep graph embedding models have been proposed in recent years [47,75,35]. Similarity learning methods based on graph embedding seek to utilize node-level or graph-level representations learned by graph  [95,92,75]. Given a collection of graphs, these works first aim to convert each graph G into a d−dimensional space (d V ), where the graph is represented as either a set of d−dimensional vectors with each vector representing the embedding of one node (i.e.,node-level embedding) or a d−dimensional vector for the whole graph as the graph-level embedding [23]. The graph embeddings are usually learned in an unsupervised manner in a separate stage prior to the similarity learning stage, where the graph embeddings obtained are used for estimating or predicting the similarity score between each pair of graphs.

Node-level Embedding based Methods
Node-level embedding based methods compare graphs using the node-level representations learned from the graphs. The similarity scores obtained by these methods mainly capture the similarity between the corresponding nodes in two graphs. Therefore they focus on the local node-level information on graphs during the learning process.
node2vec-PCA. In [92], the node2vec approach [41] is employed for obtaining the node-level embeddings of graphs. To make the embeddings of all the graphs in the given collection comparable, they apply the principal component analysis (PCA) on the embeddings to retain the first d D principal components (where D is the dimensionality of the original node embedding space). Afterwards, 2D slices are extracted from the node embedding space, where each 2D slice is used to compute a 2D histogram based on a grid structure imposed on the 2D space. Then, the graph is represented as a stack of 2D histograms of its node embeddings. The graphs are then compared in the grid space and input into a 2D CNN as multi-channel image-like structures for a graph classification task.
Bag-of-Vectors. In [78], the nodes of the graphs are first embedded in the Euclidean space using the eigenvectors of the adjacency matrices of the graphs, and each graph is then represented as a bag-of-vectors. The similarity between two graphs is then measured by computing a matching based on the Earth Mover's Distance [83] between the two sets of embeddings.
Although node embedding based graph similarity learning methods have been extensively developed, a common problem with these methods is that, since the comparison is based on node-level representations, the global structure of the graphs tends to be ignored, which actually is very important for comparing two graphs in terms of their structural patterns.

Graph-level Embedding based Methods
The graph-level embedding based methods aim to learn a vector representation for each graph and then learn the similarity score between graphs based on their vector representations.
graph2vec. In [75], a graph2vec was proposed to learn distributed representations of graphs, similar to Doc2vec [56] in natural language processing. In graph2vec each graph is viewed as a document and the rooted subgraphs around every node in the graph are viewed as words that compose the document. The Weisfeiler-Lehman relabeling process is used to extract the rooted subgraphs, and skip-gram with negative sampling is applied for updating the embeddings. After the graph embedding is obtained for each graph, the similarity or distance between graphs are computed in the embedding space for downstream prediction tasks (e.g., graph classification, clustering, etc.).
Neural Networks with Structure2vec. In [106], a deep graph embedding approach is proposed for cross-platform binary code similarity detection. A Siamese architecture is applied to enable the pair-wise similarity learning, and the graph embedding network based on Structure2vec [27] is used for learning graph representations in the twin networks, which share weights with each other. Given a set of K pairs of graphs < G i , G i >, with ground truth pair label y i ∈ {+1, −1}, where y i = +1 indicates that G i and G i are similar, and y i = −1 indicates they are dissimilar. With the Structure2vec embedding output for G i and G i , represented as f i and f i respectively, they define the Siamese network output for each pair as and the following loss function is used for training the model.
Simple Permutation-Invariant GCN. In [10], a graph representation learning method based on simple permutation-invariant graph convolutional network is proposed for the graph similarity and graph classification problem. A graph convolution module is used to encode local graph structure and node features, after which a sum-pooling layer is used to transform the substructure feature matrix computed by the graph convolutions into a single feature vector representation of the input graphs. The vector representation is then used as features for each graph, based on which the graph similarity or graph classification task can be performed.
SEED: Sampling, Encoding, and Embedding Distributions. In [99], an inductive and unsupervised graph representation learning approach called SEED is proposed for graph similarity learning. The proposed framework consists of three components: sampling, encoding, and embedding distribution. In the sampling stage, a number of subgraphs called WEAVE are sampled based on the random walk with earliest visit time. Then in the encoding stage, an autoencoder [44] is used to encode the subgraphs into dense low-dimensional vectors. Given a set of k sampled WEAVEs {X 1 , X 2 , X 3 , · · · , X k }, for each subgraph X i the autoencoder works as follows.
where z i is the dense low-dimensional representation for the input WEAVE subgraph X i , f (·) is the encoding function implemented with an Multi-layer Perceptron (MLP) with parameters θ e , and g(·) is the decoding function implemented by another MLP with parameters θ d . A reconstruction loss is used to train the autoencoder: After the autoencoder is well trained, the final subgraph embedding vectors z 1 , z 2 , z 3 , · · · , and z k can be obtained for each graph. Finally, in the embedding distribution stage, the distance between the subgraph distributions of two input graphs G and H is evaluated using the maximum mean discrepancy (MMD) [40] on the embeddings. Assume the k subgraphs sampled from G are encoded into embeddings z 1 , z 2 , · · · , z k , and the k subgraphs of H are encoded into embeddings h 1 , h 2 , · · · , h k , the MMD distance between G and H is: whereμ G andμ H are empirical kernel embeddings of the two distributions, which are defined as: where φ(·) is the feature mapping function used for the kernel function for graph similarity evaluation. An identity kernel is applied in this work.
DGCNN: Disordered Graph CNN. In [104], another graph-level representation learning approach called DGCNN is introduced based on graph CNN and mixed Gaussian model, where a set of key nodes are selected from each graph. Specifically, to ensure the number of neighborhoods of the nodes in each graph is consistent, the same number of key nodes are sampled for each graph in a key node selection stage. Then a convolution operation is performed over the kernel parameter matrix and the nodes in the neighborhood of the selected key nodes, after which the graph CNN takes the output of the convolutional layer as the input data of the overall connection layer. Finally, the output of the dense hidden layer is used as the feature vector for each graph in the graph similarity retrieval task.
N-Gram Graph Embedding. In [64], an unsupervised graph representation based method called N -gram is proposed for similarity learning on molecule graphs. It first views each node in the graph as one token and applies an analog of the CBOW (continuous bag of words) [73] strategy and trains a neural network to learn the node embeddings for each graph. Then it enumerates the walks of length n in each graph, where each walk is called an n-gram, and obtains the embedding for each n-gram by assembling the embeddings of the nodes in the n-gram. The final graph-level representation is constructed based on the embeddings of all the n-grams in the graph. Finally, the graph-level embeddings are used for the similarity prediction or graph classification task for molecule analysis. By summarizing the embedding based methods, we find the main advantage of these methods is that they provide a variety of perspectives and strategies for learning representations from graphs and demonstrate that these representations can be used for graph similarity learning. However, there are also shortcomings in these solutions, a common one being that the embeddings are learned independently on the individual graphs in a separate stage from the similarity learning, therefore the graph-graph proximity is not considered or utilized in the graph representation learning process, and the representations learned by these models may not be suitable for graph-graph similarity prediction compared to the methods that integrate the similarity learning with the graph representation learning in an end-to-end framework.

GNN-based Graph Similarity Learning
GNNs. Graph neural networks (GNNs) were first formulated in [38], which proposed to use a propagation process to learn node representations for graphs. It has then been further extended by [84,33]. Later, graph convolutional networks were proposed which compute node updates by aggregating information in local neighborhoods [20,28,53], and they have become the most popular graph neural networks, which are widely used and extended for graph representation learning in various domains [117,115,36,34,35].
With the development of graph neural networks, researchers began to build graph similarity learning models based on GNNs. In this section, we will first introduce the workflow of GCNs with the spectral GCN [87] as an example, and then describe the GNN-based graph similarity learning methods covering three main categories. Given where I m is the identity matrix. Assume the orthonormal eigenvectors of L are represented as {u l } m−1 l=0 ∈ R m×m , and their associated eigenvalues are {λ l } m−1 l=0 , the Laplacian is diagonalized by the Fourier basis [u 0 , · · · , u m−1 ](= U) ∈ R m×m and L = UΛU T where Λ = diag([λ 0 , · · · , λ m−1 ]) ∈ R m×m . The graph Fourier transform of a signal x ∈ R m can then be defined asx = U T x ∈ R m [87]. Suppose a signal vector x : V → R is defined on the nodes of graph G, where x i is the value of x at the i th node. Then the signal x can be filtered by g θ as where the filter g θ (Λ) can be defined as g θ (Λ) = K−1 k=0 θ k Λ k , and the parameter θ ∈ R K is a vector of polynomial coefficients [28]. GCNs can be constructed by stacking multiple convolutional layers in the form of Equation (7), with a non-linearity activation (ReLU) following each layer.
The similarity learning methods based on GNNs seek to learn graph representations by GNNs while doing the similarity learning task in an end-to-end fashion. Fig. 1 illustrates a general workflow of GNN-based graph similarity learning models. Given pairs of input graphs < G i , G j , y ij >, where y ij denotes the ground-truth similarity label or score of < G i , G j >, the GNN-based GSL methods first employ multi-layer GNNs with weights W to learn the representations for G i and G j in the encoding space, where the learning on each graph in a pair could influence each other by some mechanisms such as weight sharing and cross-graph interactions between the GNNs for the two graphs. A matrix or vector representation will be output for each graph by the GNN layers, after which a dot product layer or fully connected layers can be added to produce or predict the similarity scores between two graphs. Finally, the similarity estimates for all pairs of graphs and their ground-truth labels are used in a loss function for training the model M with parameters W .
Based on how graph-graph similarity/proximity is leveraged in the learning, we summarize the existing GNN-based graph similarity learning work into three main categories: 1) GNN-CNN mixed models for graph similarity prediction, 2) Siamese GNNs for graph similarity prediction, and 3) GNN-based graph matching networks.

GNN-CNN Models for Graph Similarity Prediction
The works that use GNN-CNN mixed networks for graph similarity prediction mainly employ GNNs to learn graph representations and leverage the learned representations into CNNs for predicting similarity scores, which is approached as a classification or regression problem. Fully connected layers are often added for the similarity score prediction in an end-to-end learning framework.
GSimCNN. In [13], a method called GSimCNN is proposed for pairwise graph similarity prediction, which consists of three stages. In Stage 1, node representations are first generated by multi-layer GCNs, where each layer is defined as where N (i) is the set of first-order neighbors of node i plus node i itself, d i is the degree of node i plus 1, W (l) is the weight matrix for the l−th GCN layer, b (l) is the bias, and ReLU (x) = max(0, x) is the activation function. In Stage 2, the inner products between all possible pairs of node embeddings between two graphs from different GCN layers are calculated, which results in multiple similarity matrices. Finally, the similarity matrices from different layers are processed by multiple independent CNNs, where the output of the CNNs are concatenated and fed into fully connected layers for predicting the final similarity score s ij for each pair of graphs G i and G j .
SimGNN. In [12], a SimGNN model is introduced based on the GSimCNN from [13]. In addition to pairwise node comparison with node-level embeddings from the GCN output, neural tensor networks (NTN) [88] are utilized to model the relation between the graph-level embeddings of two input graphs, whereas the graph embedding for each graph is generated via a weighted sum of node embeddings, and a global context-aware attention is applied on each node, such that nodes similar to the global context receive higher attention weights. Finally, both the comparison between node-level embeddings and graph-level embeddings are considered for the similarity score prediction in the CNN fully connected layers.

Siamese GNN models for Graph Similarity Learning
This category of works uses the Siamese network architecture with GNNs as twin networks to simultaneously learn representations from two graphs, and then obtain a similarity estimate based on the output representations of the GNNs. Fig. 2 shows an example of Siamese architecture with GCNs in the twin networks, where the weights of the networks are shared with each other. The similarity estimate is typically leveraged in a loss function for training the network.
Siamese GCN. [55] proposes to learn a graph similarity metric using the Siamese graph convolutional neural network (S-GCN) in a supervised setting. The S-GCN takes a pair of graphs as inputs and employs spectral GCN to get graph embedding for each input graph, after which a dot product layer followed by a fully connected layer is used to produce the similarity estimate between the two graphs in the spectral domain.
Higher-order Siamese GCN. Higher-order Siamese GCN (HS-GCN) is proposed in [65], which incorporates higher-order node-level proximity into graph convolutional networks so as to perform higher-order convolutions on each of the input graphs for the graph similarity learning task. A Siamese framework is employed with the proposed higher-order GCN in each of the twin networks. Specifically, random walk is used for capturing higher-order proximity from graphs and refining the graph representations used in graph convolutions. Both this work and the S-GCN [55] introduced above use the Hinge loss for training the Siamese similarity learning models: where N is the total number of graphs in the training set, K = N (N − 1)/2 is the total number of pairs from the training set, y ij is the ground-truth label for the pair of graphs G i and G j where y ij = 1 for similar pairs and y ij = −1 for dissimilar pairs, and s ij is the similarity score estimated by the model. More general forms of higher-order information (e.g., motifs [5,6]) have been used for learning graph representations [81] and would likely benefit the learning.
Community-preserving Siamese GCN. In [63], another Siamese GCN based model called SCP-GCN is proposed for the similarity learning in functional and structural joint analysis of brain networks, where the graph structure used in the GCN is defined from the structural connectivity network while the node features come from the functional brain network. The contrastive loss (Equation (10)) along with a newly proposed community-preserving loss (Equation (11)) is used for training the model.
where g i and g j are the graph embeddings of graph G i and graph G j computed from the GCN, m is a margin value which is greater than 0. y ij = 1 if G i and G j are from the same class and y ij = 0 if they are from different classes. By minimizing the contrastive loss, the Euclidean distance between two graph embedding vectors will be minimized when the two graphs are from the same class, and maximized when they belong to different classes. The communitypreserving loss is defined as follows.
where S c contains the indexes of nodes belonging to community c,ẑ c = 1 |Sc| i∈Sc z i is the community center embedding for each community c, where z i is the embedding of the i th node, i.e., the i th row in the node embedding Z of the GCN output, and α and β are the weights balancing the intra/intercommunity loss.
Hierarchical Siamese GNN. In [101], a Siamese network with two hierarchical GNNs is introduced for the similarity learning of heterogeneous graphs for unknown malware detection. Specifically, they consider the path-relevant sets of neighbors according to meta-paths and generate node embeddings by selectively aggregating the entities in each path-relevant neighbor set. The loss function in Equation (2) is used for training the model.
Siamese GCN for Image Retrieval. In [24], Siamese GCNs are used for content based remote sensing image retrieval, where each image is converted to a region adjacency graph in which each node represents a region segmented from the image. The goal is to learn an embedding space that pulls semantically coherent images closer while pushing dissimilar samples far apart. Contrastive loss is used in the model training.
Since the twin GNNs in the Siamese network share the same weights, an advantage of the Siamese GNN models is that the two input graphs are guaranteed to be processed in the same manner by the networks. As such, similar input graphs would be embedded similarly in the latent space. Therefore, the Siamese GNNs are good for differentiating the two input graphs in the latent space or measuring the similarity between them.
In addition to choosing the appropriate GNN models in the twin networks, one needs to choose a proper loss function. Another widely used loss function for Siamese network is the triplet loss [85]. For a triplet (G i , G p , G n ), G p is from the same class as G i , while G n is from a different class from G i . The triplet loss is defined as follows.
where K is the number of triplets used in the training, d ip represents the distance between G i and G p , d in represents the distance between G i and G n , and m is a margin value which is greater than 0. By minimizing the triplet loss, the distance between graphs from same class (i.e., d ip ) will be pushed to 0, and the distance between graphs from different classes (i.e.,d in will be pushed to be greater than d ip + m.
It is important to consider which loss function would be suitable for the targeted problem when applying these Siamese GNN models for the graph similarity learning task in practice.

GNN-based Graph Matching Networks
The work in this category adapts Siamese GNNs by incorporating matching mechanisms during the learning with GNNs, and cross-graph interactions are considered in the graph representation learning process.
GMN: Graph Matching Network. In [60], Graph Neural Networks (GNN) are trained to produce embedding of graphs in vector spaces that enable similarity learning. It proposes a Graph Matching Network (GMN) model, where the node update module in each propagation layer takes into account both the aggregated messages on the edges for each graph and a cross-graph matching vector which measures how well a node in one graph can be matched to the nodes in the other graph. Given a pair of graphs as input, the GMN jointly learns graph representations for the pair through the cross-graph attentionbased matching mechanism, which propagates node representations by using both the neighborhood information within the same graph and cross-graph node information. A similarity score between the two input graphs is computed in the latent vector space.
NeuralMCS: Neural Maximum Common Subgraph GMN. Based on the graph matching network in [60], [14] proposes a neural maximum common subgraph (MCS) detection approach for learning graph similarity. The graph matching network is adapted to learn node representations for two input graphs G 1 and G 2 , after which a likelihood of matching each node in G 1 to each node in G 2 is computed by a normalized dot product between the node embeddings. The likelihood indicates which node pair is most likely to be in the MCS, and the likelihood for all pairs of nodes constitutes the matching matrix Y for G 1 and G 2 . Then a guided subgraph extraction process is applied, which starts by finding the most likely pair and iteratively expands the extracted subgraphs by selecting one more pair at a time until adding more pairs would lead to non-isomorphic subgraphs. To check the subgraph isomorphism, subgraph-level embeddings are computed by aggregating the node embeddings of the neighboring nodes that are included in the MCS, and Euclidean distance between the subgraph embeddings are computed. Finally, a similarity/match score is obtained based on the subgraphs extracted from G 1 and G 2 .
Hierarchical Graph Matching Network. In [62], a hierarchical graph matching network is proposed for graph similarity learning, which consists of a Siamese GNN for learning global-level interactions between two graphs and a multi-perspective node-graph matching network for learning the crosslevel node-graph interactions between parts of one graph and one whole graph. Given two graphs G 1 and G 2 as inputs, a three-layer GCN is utilized to generate embeddings for them, and aggregation layers are added to generate the graph embedding vector for each graph. In particular, cross-graph attention coefficients are calculated between each node in G 1 and all the nodes in G 2 , and between each node in G 2 and all the nodes in G 1 . Then the attentive graph-level embeddings are generated using the weighted average of node embeddings of the other graph, and a multi-perspective matching function is defined to compare the node embeddings of one graph with the attentive graphlevel embeddings of the other graph. Finally, the BiLSTM model [86] is used to aggregate the cross-level interaction feature matrix from the node-graph matching layer, followed by the final prediction layers for the similarity score learning.
NCMN: Neural Graph Matching Network. In [43], a Neural Graph Matching Network (NGMN) is proposed for few-shot 3D action recognition, where 3D data are represented as interaction graphs. A GCN is applied for updating node features in the graphs and an MLP is employed for updating the edge strength. A graph matching metric is then defined based on both node matching features and edge matching features. In the proposed NGMN, edge generation and graph matching metric are learned jointly for the few-shot learning task.
Recently, deep graph matching networks were introduced for the graph matching problem for image matching [31,112,49,100]. Graph matching aims to find node correspondence between graphs, such that the corresponding node and edge's affinity is maximized. Although the problem of graph matching is different from the graph similarity learning problem we focus on in this survey and is beyond the scope of this survey, some work on deep graph matching networks involves graph similarity learning and thus we review some of this work below to provide some insights into how deep similarity learning may be leveraged for graph matching applications, such as image matching.
GMNs for Image Matching. In [49], a Graph Learning-Matching Network is proposed for image matching. A CNN is first utilized to extract feature descriptors of all feature points for the input images, and graphs are then constructed based on the features. Then the GCNs are used for learning node embeddings from the graphs, in which both intra-graph convolutions and cross-graph convolutions are conducted. The final matching prediction is formulated as node-to-node affinity metric learning in the embedding space, and the constraint regularized loss along with cross-entropy loss is used for the metric learning and the matching prediction. In [100], another GNN-based graph matching network is proposed for the image matching problem, which consists of a CNN image feature extractor, a GNN-based graph embedding component, an affinity metric function and a permutation prediction component, as an end-to-end learnable framework. Specifically, GCNs are used to learn node-wise embeddings for intra-graph affinity, where a cross-graph aggregation step is introduced to aggregate features of nodes in the other graph for incorporating cross-graph affinity into the node embeddings. The node embeddings are then used for building an affinity matrix which contains the similarity scores at the node level between two graphs, and the affinity matrix is further used for the matching prediction. The cross-entropy loss is used to train the model end-to-end.

Deep Graph Kernels
Graph kernels have become a standard tool for capturing the similarity between graphs for tasks such as graph classification [96]. Given a collection of Kernels [8].
graphs, possibly with node or edge attributes, the work in graph kernel aim to learn a kernel function that can capture the similarity between any two graphs. Traditional graph kernels, such as random walk kernels, subtree kernels, and shortest-path kernels have been widely used in the graph classification task [79]. Recently, deep graph kernel models have also emerged, which build kernels based on the graph representations learned via deep neural networks.
Deep Graph Kernels. In [108], a Deep Graph Kernel approach is proposed. For a given set of graphs, each graph is decomposed into its sub-structures. Then sub-structures are viewed as words and neural language models in the form of CBOW (continuous bag-of-words) and Skip-gram are used to learn latent representations of sub-structures from the graphs, where corpora are generated for the Shortest-path graph and Weisfeiler-Lehman kernels in order to measure the co-occurrence relationship between substructures. Finally, the kernel between two graphs is defined based on the similarity of the substructure space.
Deep Divergence Graph Kernels. In [8], a model called Deep Divergence Graph Kernels (DDGK) is introduced to learn kernel functions for graph pairs. Given two graphs G 1 and G 2 , they aim to learn an embedding based kernel function k() as a similarity metric for graph pairs, defined as: where Ψ (G i ) is a representation learned for G i . This work proposes to learn graph representation by measuring the divergence of the target graph across a population of source graph encoders. Given a source graph collection {G 1 , G 2 , · · · , G n }, a graph encoder is first trained to learn the structure of each graph in the source collection. Then, for a target graph G T , the divergence of G T from each source graph is measured, after which the divergence scores are used to compose the vector representation of the target graph G T . Fig. 3 illustrates the above graph representation learning process. Specifically, the divergence score between a target graph G T = (V T , E T ) and a source graph G S = (V S , E S ) is computed as follows: where H S is the encoder trained on graph S.
Graph Neural Tangent Kernel. In [30], a Graph Neural Tangent Kernel (GNTK) is proposed for fusing GNNs with the neural tangent kernel, which is originally formulated for fully-connected neural networks in [48] and later introduced to CNNs in [9]. Given a pair of graphs < G, G >, they first apply GNNs on the graphs. Let f (θ, G) ∈ R be the output of the GNN under parameters θ ∈ R m on input Graph G. To get the corresponding GNTK value, they calculate the expected value of in the limit that m → ∞ and θ are all Gaussian random variables.
Meanwhile, there are also some deep graph kernels proposed for the node representation learning on graphs for node classification and node similarity learning. For instance, in [91], a learnable kernel-based framework is proposed for node classification, where the kernel function is decoupled into a feature mapping function and a base kernel. An encoder-decoder function is introduced to project each node into the embedding space and reconstructs pairwise similarity measurements from the node embeddings. Since we focus on the similarity learning between graphs in this survey, we will not discuss this work further.

Applications
Graph similarity learning is a fundamental problem in domains where data are represented as graph structures, and it has various applications in the real world.

Computational Chemistry and Biology
An important application of graph similarity learning in the chemistry and biology domain is to learn the chemical similarity, which aims to learn the similarity of chemical elements, molecules or chemical compounds with respect to their effect on reaction partners in inorganic or biological settings [19]. An example is the compounds query for in-silico drug screening, where searching for similar compounds in a database is the key process.
In the literature of graph similarity learning, quite a number of models have been proposed and applied to similarity learning for chemical compounds or molecules. Among these work, the traditional models mainly employ subgraph based search strategies or graph kernels to solve the problem [116,113,89,69]. However, these methods tend to have high computational complexity and strongly rely on the sub-graph or kernels defined, making it difficult to use them in real applications. Recently, a deep graph similarity learning model SimGNN is proposed in [12] which also aims to learn similarity for chemical compounds as one of the tasks. Instead of using sub-graphs or other explicit features, the model adopts GCNs to learn node-level embeddings, which are fed into an attention module after multiple layers of GCNs to generate the graphlevel embeddings. Then a neural tensor network (NTN) [88] is used to model the relation between two graph-level embeddings, and the output of the NTN is used together with the pairwise node embedding comparison output in the fully connected layers for predicting the graph edit distance between the two graphs. This work has shown that the proposed deep learning model outperforms the traditional methods for graph edit distance computation in prediction accuracy and with much less running time, which indicates the promising application of the deep graph similarity learning models in the chemo-informatics and bio-informatics.

Neuroscience
Many neuroscience studies have shown that structural and functional connectivity of the human brain reflects the brain activity patterns that could be indicators of the brain health status or cognitive ability level [11,67,68]. For example, the functional brain connectivity networks derived from fMRI neuroimaging data can reflect the functional activity across different brain regions, and people with brain disorder like Alzheimer's disease or bipolar disorder tend to have functional activity patterns that differ from those of healthy people [11,90,66]. To investigate the difference in brain connectivity patterns for these neuroscience problems, researchers have started to study the similarity of brain networks among multiple subjects with graph similarity learning methods [55,65].
The organization of functional brain networks is complicated and usually constrained by various factors, such as the underlying brain anatomical network, which plays an important role in shaping the activity across the brain. These constraints make it a challenging task to characterize the structure and organization of brain networks while performing similarity learning on them. Recent work in [55], [65] and [63] have shown that the deep graph models based on graph convolutional networks have a superior ability to capture brain connectivity features for the similarity analysis compared to the traditional graph embedding based approaches. In particular, [65] proposes a higher-order Siamese GCN framework that leverages higher-order connectivity structure of functional brain networks for the similarity learning of brain networks for multi-subject analysis with respect to brain health status and cognitive abilities, and its superior performance on a number of real fMRI datasets implies the promising value of such graph similarity learning models in clinical investigation of brain diseases and neuroscience applications.

Computer Security
In the field of computer security, graph similarity has also been studied for various application scenarios, such as the hardware security problem [71], the malware indexing problem based on function-call graphs [46], and the binary function similarity search for identifying vulnerable functions [60].
In [71], a graph similarity heuristic is proposed based on spectral analysis of adjacency matrices for the hardware security problem, where evaluations are done for three tasks, including gate-level netlist reverse engineering, Trojan detection, and obfuscation assessment. The proposed method outperforms the graph edit distance approximation algorithm proposed in [46] and the neighbor matching approach [97], which matches neighboring vertices based on graph topology. [60] is the work that introduced GNN-based deep graph similarity learning models to the security field to solve the binary function similarity search problem. Compared to previous models, the proposed deep model computes similarity scores jointly on pairs of graphs rather than first independently mapping each graph to a vector, and the node representation update process uses an attention-based module which considers both withingraph and cross-graph information. Empirical evaluations demonstrate the superior performance of the proposed deep graph matching networks compared to the Google's open source function similarity search tool [1], the basic GNN models, and the Siamese GNNs.

Computer Vision
Graph similarity learning has also been explored for applications in computer vision. In [103], context-dependent graph kernels are proposed to measure the similarity between graphs for human action recognition in video sequences. Two directed and attributed graphs are constructed to describe the local features with intra-frame relationships and inter-frame relationships, respectively. The graphs are decomposed into a number of primary walk groups with different walk lengths, and a generalized multiple kernel learning algorithm is applied to combine all the context-dependent graph kernels, which further facilitates human action classification. In [43], a deep model called Neural Graph Matching Network is first introduced for the 3D action recognition problem in the few-shot learning setting. Interaction graphs are constructed from the 3D scenes, where the nodes represent physical entities in the scene and edges represent interactions between the entities. The proposed NGM Networks jointly learn a graph generator and a graph matching metric function in an end-to-end fashion to directly optimize the few-shot learning objective. It has been shown to significantly improve the few-shot 3D action recognition over the holistic baselines. This work demonstrates the promising applications of deep graph similarity learning models for the practical learning tasks in computer vision, where a key problem is to first convert the image or video data to graphs.

Various Graph Types
In most of the work discussed above, the graphs involved consist of unlabeled nodes/edges and undirected edges. However, there are many variants of graphs in real world applications. How to build deep graph similarity learning models for these various graph types is a challenging problem.
Directed Graphs. In some application scenarios, the graphs are directed, which means all the edges in the graph are directed from one vertex to another. For instance, in a knowledge graph, edges go from one entity to another, where the relationship is directed. In such cases, we should treat the information propagation process differently according to the direction of the edge. Recently some GCN based graph models have suggested some strategies for dealing with such directed graphs. In [52], a dense graph propagation strategy is proposed for the propagation on knowledge graphs, where two kinds of weight matrices are introduced for the propagation based on a node's relationship to its ancestors and descendants respectively. However, to the best of our knowledge, no work has been done on deep similarity learning specifically for directed graphs, which arises as a challenging problem for this community.
Labeled Graphs. Labeled graphs are graphs where vertices or edges have labels. For example, in chemical compound graphs where vertices denote the atoms and the edges represent the chemical bonds between the atoms, each node and edge have labels representing the atom type and bond type, respectively. These labels are important for characterizing the node-node relationship in the graphs, therefore it is important to leverage these label information for the similarity learning. In [12,7], the node label information are used as the initial node representations encoded by a one-hot vector and used in the node embedding stage. In this case, the nodes with same type share the same one-hot encoding vector. This should guarantee that even if the node ids are permuted, the aggregation results would be the same. However, the label information is only used for the node embedding process within each graph, and the comparison of the node or edge labels across graphs is not considered during the similarity learning stage. How to better leverage the label information into the similarity learning process is a critical problem.
Dynamic and Streaming Graphs. Another kind of graph is the dynamic graph, which has a static graph structure and dynamic input signals/features. For example, the 3D human action or motion data can be represented as graphs where the entities are represented as nodes and the actions as edges connecting the entities. Then similarity learning on these graphs is an important problem for action and motion recognition. Also, another type of graph is the streaming graph, where both the structure and/or features are dynamically changing, for example, online social networks [3,2,4]. The similarity learning would be important for change/anomaly detection, link prediction, etc. Although some work has proposed variants of GCN models for spatio-temporal graphs [111,70], and other learning methods for dynamic graphs [77,76,93,59], the similarity learning problem on dynamic and streaming graphs has not been well studied. The main challenge in this problem is how to leverage the temporal updates of the node-level representations and the interactions between the nodes on these graphs while modeling their similarity.

Interpretability
The deep graph models, such as GNNs, combine node feature information with graph structure by recursively passing neural messages along edges of the graph, which is a complex process and makes it challenging to explain the learning results from these models. Recently, some work has started to explore the interpretability of GNNs [109,15]. In [109], a GNNEXPLAINER is proposed for providing interpretable explanations for predictions of GNNbased models. It first identifies a subgraph structure and a subset of node features that are crucial in a prediction. Then it formulates an optimization task that maximizes the mutual information between a GNN's prediction and the distribution of possible subgraph structures. [15] explores the explainability of GNNs using gradient-based and decomposition-based methods, respectively, on a toy dataset and a chemistry task. Although these works have provided some insights into the interpretability of GNNs, they are mainly for node classification or link prediction tasks on a graph. To the best of our knowledge, the explainability of GNN-based graph similarity models remains unexplored.

Few-shot Learning
The task of few-shot learning is to learn classifiers for new classes with only a few training examples per class. A big branch of work in this area is based on metric learning [102]. However, most of the existing work proposes fewshot learning problems on images, such as image recognition [54] and image retrieval [94]. Little work has been done on metric learning for few-shot learning on graphs, which is an important problem for areas in which data are represented as graphs and data gathering is difficult, for example, brain connectivity network analysis in neuroscience. Since graph data usually has complex structure, how to learn a metric so that it can facilitate generalizing from a few graph examples is a big challenge. Some recent work [43] has begun to explore the few-shot 3D action recognition problem with graph-based similarity learning strategies, where a neural graph matching network is proposed to jointly learn a graph generator and a graph matching metric function to optimize the few-shot learning objective of 3D action recognition. However, since the objective is defined specifically based on the 3D action recognition task, the model can not be directly used for other domains. The remaining problem is to design general deep graph similarity learning models for the few-shot learning task for a multitude of applications.

Conclusion
In this survey, we provided a comprehensive review of the existing work on deep graph similarity learning, categorized into three main categories: graph embedding based graph similarity learning models, GNN-based models, and deep graph kernels. We also summarized the different properties and existing applications of these models. Finally, we shed light on the key challenges and future directions for the deep graph similarity learning problem.