# High efficiency and quality: large graphs matching

## Authors

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s00778-012-0292-8

- Cite this article as:
- Zhu, Y., Qin, L., Yu, J.X. et al. The VLDB Journal (2013) 22: 345. doi:10.1007/s00778-012-0292-8

- 4 Citations
- 797 Views

## Abstract

Graph matching plays an essential role in many real applications. In this paper, we study how to match two large graphs by maximizing the number of matched edges, which is known as maximum common subgraph matching and is NP-hard. To find exact matching, it cannot a graph with more than 30 nodes. To find an approximate matching, the quality can be very poor. We propose a novel two-step approach that can efficiently match two large graphs over thousands of nodes with high matching quality. In the first step, we propose an anchor-selection/expansion approach to compute a good initial matching. In the second step, we propose a new approach to refine the initial matching. We give the optimality of our refinement and discuss how to randomly refine the matching with different combinations. We further show how to extend our solution to handle labeled graphs. We conducted extensive testing using real and synthetic datasets and report our findings in this paper.

### Keywords

Graph matchingMaximum common subgraphVertex cover## 1 Introduction

Graph proliferates in a wide variety of applications, including social networks in psycho-sociology, attributed graphs in image processing, food chains in ecology, electrical circuits in electricity, road networks in transport, protein interaction networks in biology, topological networks on the Web. Graph processing has attracted great attention from both research and industrial communities.

Graph matching is an important type of graph processing, which aims at finding correspondences between the nodes/edges of two graphs to ensure that some substructures in one graph are mapped to similar substructures in the other. Graph matching plays an essential role in a large number of concrete applications [13].

**Biology:** Protein–protein interaction (PPI) networks play an important role in most biological processes, in which a node corresponds to a protein and an edge indicates the interaction between two proteins. Comparative analysis of PPI networks across species provides insightful views of similarities and differences between species at systemic level, and helps to identify conserved functional components across species. Graph matching can be effectively used for such PPI networks comparisons, to maximally identify the pairs of homologous proteins from two different organisms such that PPIs are conserved between matched pairs [28, 37].

**Biochemistry:** The genome of an organism is represented as a graph with genes as nodes and binary relations between genes as edges, and the metabolic pathway is represented as another graph with enzymes as nodes and chemical compounds as edges. These two graphs are then matched to identify FRECs (functionally related enzyme clusters) that reveal important biological features of the organisms [24].

**Medicine:** The electroencephalogram (EEG) signal can be transformed into a graph with the extracted energy bursts as nodes. Graph matching is applied to the comparison of two EEG signals to analyze different brain activities in terms of latency, frequency, energy, and activated areas [7].

**Video Indexing:** A region adjacency graph (RAG) is constructed to represent an object, where the nodes are segmented regions in video frames. Graph matching between two RAGs can be used to retrieve similar objects in video-shot collections [12].

**Schema Matching:** In data integration and service interoperability, schema matching is important, which aims at identifying correspondences between metadata structures or models. Consider a comparison shopping website that aggregates product offers from multiple independent online stores. Since each website can be modeled as a graph, graph matching can also be used to solve schema matching problems [22].

In the literature, a number of algorithms have been proposed for graph matching including exact matching [19, 21, 31] and approximate matching [4, 10, 17, 20, 25, 27, 32, 34]. The exact approaches are able to find the optimal matching at the cost of exponential running time, while the approximate approaches are much more efficient but can get poor matching results. More importantly, most of them can only handle small graphs with tens to hundreds of nodes. As an indication, exactly matching two undirected graphs with 30 nodes may take 100,000 s. It is important to note that real-world networks nowadays can be very large. The existing approaches cannot efficiently match graphs even with thousands of nodes with high quality.

In this paper, we study the problem of matching two large graphs, which is formulated as follows. Given two graphs \(G_1\) and \(G_2\), we find a one-to-one matching between the nodes in \(G_1\) and \(G_2\) such that the number of the matched edges is maximized. The optimal solution to the problem corresponds to the maximum common subgraph (*MCS*) between \(G_1\) and \(G_2\), which is an NP-hard problem, and has been studied in decades. It is known to be very difficult to find a high-quality approximate matching efficiently even for small graphs. In order to meet the needs of handling large graphs for graph matching and analysis, we propose a novel approximate solution with polynomial time complexity while still attaining high matching quality.^{1}

The main contributions of this paper are summarized below. We propose a novel two-step approach, namely, matching construction and matching refinement. In the first matching construction, we propose a new anchor-selection/expansion approach to compute an initial matching. We give heuristics to select a small number of important anchors using a new similarity score, which measures how two nodes in two different graphs are similar to be matched by taking both global and local information of nodes into consideration. We compute a good initial matching by expanding from the anchors selected. The expansion is based on structural similarity among the neighbors of nodes in two graphs. In the second matching refinement, we propose a new approach to refine the initial matching. The novelty of our refinement is as follows. First, we refine a matching \(M\) to a better one, which is most likely to exist and can be identified. Second, we consider the efficiency, and focus on a subset of nodes to refine while giving every node in the graphs a chance to be refined. We show the optimality of our refinement. We also show how to randomly refine matchings with different combinations. Our refinement can improve the matching quality with small overhead for both unlabeled and labeled graphs. We conducted extensive testing using real and synthetic datasets, and confirmed the quality and efficiency of our approach. The average ratio of our approximate matching to the exact matching is above 90 %, while the computational cost is less than 1 % of the state-of-the-art exact algorithms. This is a big step compared to all the approximate algorithms to match large graphs in the literature.

The rest of the paper is organized as follows. Section 2 discusses some related work. Section 3 gives the problem statement. Section 4 gives an overview of our two-step approach. Sections 5 and 6 discuss the matching construction and matching refinement. Section 7 extends our work to handle labeled graphs. Section 8 shows the performance results. Section 9 concludes this paper.

## 2 Related work

We discuss exact graph matching and approximate graph matching, according to whether (sub)graph isomorphism problem or maximum common subgraph problem is involved.

For exact graph matching, in the literature, most of the algorithms use backtracking (refer to Ullmann’s algorithm for subgraph and graph isomorphism [31]). Existing solutions on finding the maximum common subgraph mainly focus on the maximum common node induced subgraph, and most techniques can hardly be used for the maximum common edge induced subgraph. Among them, Mcgregor [21] proposes a backtracking search method for finding the maximum common subgraph. An improved backtracking algorithm is given in [19] with time complexity \(O(m^{n+1}\cdot n)\), where \(n\) and \(m\) are the numbers of vertices of \(G_1\) and \(G_2\), respectively. Abu-Khzam et al. [1] propose an algorithm that combines backtracking and vertex cover enumeration to solve the maximum common node induced subgraph problem. There are also some other studies to calculate the maximum common node induced subgraph by finding the maximum clique in the association graph [18, 26, 29]. The complexity of the maximum clique approach is no better than backtracking.

For approximate graph matching, there are three categories: propagation-based method, spectral-based method, and optimization-based method.

The propagation-based method is mainly based on the intuition that two nodes are similar if their respective neighborhoods are similar. In [22], a similarity flooding approach is proposed, which starts from string-based comparison of the vertices labels to obtain an initial alignment between nodes of two graphs and refines it by an iterative fix-point computation. Blondel et al. [8] construct a similarity measure between any two nodes in any two graphs based on Kleinberg’s hub and authority idea of HITS algorithm [16]. This procedure will, in general, converge to different even and odd limits which will depend upon the initial conditions. Recently, IsoRank [28] extends the propagation-based method by adding the weight of propagation into the iteration process.

Spectral-based method aims to represent and distinguish structural properties of graphs using eigenvalues and eigenvectors of graph adjacency matrices. It is based on the observation that if two graphs are isomorphic, their adjacency matrices will have the same eigenvalues and eigenvectors. Since the computation of eigenvalues can be solved in polynomial time, it is used by a lot of works in graph matching [4, 10, 17, 20, 25, 32, 34]. Among these works, Umeyama [32] uses the eigendecomposition of adjacency matrices of the graphs to derive a simple expression of the orthogonal matrix that optimizes the objective function. Xu and King [35] propose a solution to the weighted isomorphism problem that combines the use of eigenvalues/eigenvectors with continuous optimization techniques. These two methods are only suitable for graphs with the same number of nodes. In [6], the authors solve the problem to handle graphs with different number of nodes, using the Laplacian eigenmaps scheme to perform a generalized eigendecomposition of the Laplacian matrix. Caseli and Kosinov in [11] propose a method of projecting vertex into eigen-subspace for graph matching, which is used for inexact many-to-many graph matching other than one-to-one matching, and in [10] extend Umeyama’s work to match two graphs of different sizes by choosing the largest \(k\)-eigenvalues as the projection space. Knossow et al. in [17] improve the matching result by performing eigendecomposition on the Laplacian matrix since it is positive and semidefinite. Heat-kernel [34] is used to embed the nodes of the graph into vector-space based on the graph-spectral method, and the correspondence matrix between the embedded points of two graphs is computed by a variant of the Scott and Longuet-Higgins algorithm.

The optimization-based method aims to model graph matching as an optimization problem and solve it. The representative algorithms include PATH [36] and GA [37]. In PATH [36], the graph matching problem is formulated as a convex-concave programming problem, and is approximately solved. It starts from the convex relaxation and then iteratively solves the convex-concave programming problem by gradually increasing the weight of the concave relaxation and following the path of solutions thus created. GA [37] is a gradient method based approach, which starts from an initial solution and iteratively chooses a matching in the direction of a gradient objective function.

Aside from the propagation-/spectral-based methods that compute the similarity score by iterations of random walks or spectral decomposition of adjacency matrix, Jouili and Tabbone [15] propose a vector-based node signature that can be computed straightforwardly from the adjacency matrix. Here, every node is associated with a vector containing its node degree and the incident edge weights. The similarity between two nodes is computed based on their signatures, and the graph matching problem is reduced to a bipartite graph matching problem. A survey can be found in [27].

## 3 Problem statement

We first focus on undirected and unlabeled graphs, since the most difficult part for graph matching is the structural matching without any assistance of labels. We will discuss how to handle labeled graphs later in this paper. For a graph \(G(V,E)\), we use \(V(G)\) to denote the set of nodes and \(E(G)\) to denote the set of edges.

**Definition 1**

Graph/Subgraph Isomorphism.

Graph \(G_1\) is isomorphic to graph \(G_2\), if and only if there exists a bijective function \( f: V(G_1) \rightarrow V(G_2)\) such that for any two nodes \(u_1\in V(G_1)\) and \(u_2\in V(G_1)\), \((u_1,u_2)\in E(G_1)\) if and only if \((f(u_1),f(u_2))\)\(\in E(G_2)\). \(G_1\) is subgraph isomorphic to \(G_2\), if and only if there exists a subgraph \(G^{\prime }\) of \(G_2\) such that \(G_1\) is isomorphic to \(G^{\prime }\).

**Definition 2**

Maximum Common Subgraph. A graph \(G\) is the maximum common subgraph (*MCS*) of two graphs \(G_1\) and \(G_2\), denoted as \(\mathsf{mcs } (G_1,G_2)\), if \(G\) is a common subgraph of \(G_1\) and \(G_2\), and there is no other common subgraph \(G^{\prime }\), such that \(G^{\prime }\) is larger than \(G\).

*MCS*of two graphs can be disconnected, and there are two kinds of

*MCS*s, namely maximum common node induced subgraph (

*MCSv*) and maximum common edge induced subgraph (

*MCSe*). The former requires the

*MCS*to be the node induced subgraph of both \(G_1\) and \(G_2\), and \(G^{\prime }\) is larger than \(G\) iff \(|V(G^{\prime })|>|V(G)|\). The latter requires the

*MCS*to be the edge induced subgraph of both \(G_1\) and \(G_2\), and \(G^{\prime }\) is larger than \(G\) iff \(|E(G^{\prime })|>|E(G)|\). Figure 1 shows the difference between

*MCSv*and

*MCSe*. Given two graphs \(G_1\) and \(G_2\). Figure 1a shows the

*MCSv*of \(G_1\) and \(G_2\), whereas Fig. 1b shows the

*MCSe*of \(G_1\) and \(G_2\). As can be seen from this example,

*MCSe*can possibly get more common substructure for the given two graphs. In this paper, we adopt

*MCSe*since it can possibly get more common substructure for the given two graphs, and we use

*MCS*(\(\mathsf mcs \)) to denote

*MCSe*. Finding the

*MCS*of two graphs is NP-hard.

**Definition 3**

Graph Matching. Given two graphs \(G_1\) and \(G_2\), a matching \(M\) between \(G_1\) and \(G_2\) is a set of vertex pairs \(M=\{(u,v)|u\in V(G_1), v\in V(G_2)\}\), such that for any two pairs \((u_1, v_1)\in M\) and \((u_2,v_2)\in M\), \(u_1\ne u_2\) and \(v_1 \ne v_2\). The optimal matching \(M\) of two graphs is the one with the largest number of matched edges. Finding the optimal matching \(M\) is the same as finding the *MCS*.

**Problem Statement:**We aim to compute the optimal matching \(M\) for two given graphs \(G_1\) and \(G_2\). For a given matching \(M\), we evaluate its quality by computing \(score(M)\) as follows.

## 4 An overview: construction and refinement

*MCS*problem. Such backtracking search is infeasible for large graphs for two reasons.

The search space is extremely large. For two given graphs with \(n\) and \(n^{\prime }\) nodes, respectively, suppose \(n < n^{\prime }\), the search space of the backtracking method is \(O((n^{\prime }+1)^n)\), which is impractical even for \(n>30\). Although many heuristics can be used to reduce the search space, the complexity of the search space can hardly be reduced.

For each partial solution generated in the backtracking, we need to decide whether we need to expand the current solution further or cut the current branch down. Suppose the current solution can match \(|E_c|\) edges, and the current best solution can match \(|E_b|\) edges. We need to estimate an upper bound \(|E_u|\) of the maximum matching size between the unmatched parts of the two graphs. If \(|E_c| + |E_u|\le |E_b|\), we can cut the current branch down to reduce the search space. A tight upper bound can cut many useless branches and thus largely reduce the search space. However, a good upper bound can hardly be derived. In the literature, as far as we know, the best upper bound is given in [26], which is derived by only considering the degree of each node in the two graphs. Obviously, such an upper bound is very loose because it does not consider any structural information of the two graphs.

It is known that the *MCS* problem is NP-hard, and it is also known that it is very difficult to obtain a tight, or even useful, approximation bound, because finding a maximum common subgraph of two graphs is equivalent to finding a maximum clique in their association graph, which cannot be approximated with ratio \(n^\epsilon \) for any constant \(\epsilon > 0\) unless P = NP [3]. For the quality of the *MCS* result, Almohamad and Duffaa in [2] give a bound of \(O(n^2)\) based on the number of mismatched edges, where \(n\) is the size of the larger graph. This means that it may mismatch all the edges. Raymond et al. in [26] provide an upper bound for the size of the *MCS*, which is computed by sorting the degree sequences of two graphs separately followed by summarizing the corresponding smaller degrees. The bound is almost the smaller graph, without considering any structural information of the two graphs, which does not provide much information. For the time complexity, in [2], it is \(O(n^6L)\), where \(n\) is the size of the graph and \(L\) is the size of an LP model formulated for graph matching (at least \(n\)). It cannot handle graphs with more than 100 nodes.

## 5 Matching construction

In this section, we discuss how to select anchors and how to expand from the selected anchors to obtain the initial matching \(M\) for two graphs \(G_1\) and \(G_2\), using a new node similarity matrix \(S\). The node similarity between \(u \in G_1\) and \(v \in G_2\) is very important because it indicates how likely the two nodes will be matched when computing the matching \(M\).

### 5.1 Global and local node similarity

**Global node similarity:**In the literature, the global similarity for nodes in two graphs can be the spectral-based similarity. The representative study is Umeyama’s work [32] which is improved by [17]. Suppose \(G_1\) and \(G_2\) are two undirected graphs with the same number of nodes \(n\). The Laplacian matrix \(L_{n\times n}\) of graph \(G\) with \(n\) nodes is defined as \(L=D-A\), where \(A\) is the adjacency matrix and \(D\) is the diagonal degree matrix. \(A[u_1, u_2] = 1\) if \((u_1,u_2) \in E(G)\), and \(0\) otherwise. \(D[u_1, u_1] = \sum _{(u_1, u_2) \in E(G)} A[u_1, u_2]\). We denote the Laplacian matrices of \(G_1\) and \(G_2\) as \(L_1\) and \(L_2\), respectively. Suppose the eigenvalues of \(L_1\) and \(L_2\) are \(\alpha _1 \ge \alpha _2 \ge \cdots \ge \alpha _n\) and \(\beta _1 \ge \beta _2 \ge \cdots \ge \beta _n\), respectively. Since \(L_1\) and \(L_2\) are symmetric and positive-semidefinite, we have \(L_1 =U_1 \Lambda _1 U_1^T\) and \(L_2 =U_2 \Lambda _2 U_2^T\), where \(U_1\) and \(U_2\) are orthogonal matrices, and \(\Lambda _1=diag(\alpha _i)\) and \(\Lambda _2=diag(\beta _i)\). If \(G_1\) and \(G_2\) are isomorphic, there exists a permutation matrix \(P\) such that \(P U_1 \Lambda _1 U_1^T P ^T = U_2 \Lambda _2 U_2^T\). Let \(P =U_2 D^{\prime } U_1^T\) where \(D^{\prime } = diag(d_1,\ldots , d_n)\) and \(d_i \in \{+1;-1\}\) accounts for the sign ambiguity in the eigendecomposition. When \(G_1\) and \(G_2\) are isomorphic, the optimum permutation matrix is \(P\), which maximizes \(tr(P^T \bar{U}_2 \bar{U}_1 ^T )\), where \(\bar{U}_1\) and \(\bar{U}_2\) are matrices that have the absolute value of each element of \(U_1\) and \(U_2\), respectively. When the numbers of nodes in \(G_1\) and \(G_2\) are not the same, we only choose the largest \(c\) eigenvalues [17]. Let \(c=\min \{|V(G_1)|, \)\(| V ( G_2 ) | \}\), and \(\bar{U}_1^{\prime }\) and \(\bar{U}_2^{\prime }\) be the first \(c\) columns of \(\bar{U}_1\) and \(\bar{U}_2\), respectively, the global similarity matrix can be computed with Eq. (3).

Example 1 shows an example of matching two graphs using the global node similarity.

*Example 1*

**Alternative global node similarity measures:** Besides the global node similarity based on eigendecomposition, there are other global similarity measures based on the node importance in the graph in the literature, such as Katz score [9, 14] and random walk with restart (RWR) [30]. A Katz score is a weighted count of the number of walks originating (or terminating) at a given node. The walks are weighted inversely by their length so that long and highly indirect walks count less, while short and direct walks count larger. The Katz score is given by the formula \(\varvec{r}=(I- bA)^{-1}bA \varvec{u}\), where \(\varvec{r}\) is the \(N \times 1\) column vector containing Katz score for each node, \(I\) is the \(N \times N\) identity matrix, \(\varvec{u}\) is a \(N \times 1\) column vector with all entries equal to 1, and \(b \in (0,1)\) is the attenuation factor, which is \(1/(d+1)\) by default in [14], where \(d\) is the maximum degree of the graph. The extent to which the weights attenuate with length is controlled by \(b\). The RWR score is given by the formula \(\varvec{r}=(1-c)(I- cW)^{-1}\varvec{u}\), where \(W\) is a transmit matrix where \(W(i,j) = A(i,j)/ \sum _i A(i,j)\), and \(c \in [0,1]\) is the positive probability, which means a surfer at a node will jump to a random node with probability \(1-c\). Under this random walk, the importance of a node \(v\) is the expected sum of the importance of all the nodes \(u\) that link to \(v\). For two nodes, \(u\) in graph \(G_1\) and \(v\) in graph \(G_2\), they are considered highly similar if both have a high Katz/RWR score. The similarity matrix of two graphs becomes \(S_g^{\prime } = \varvec{r_1} \varvec{r_2}^T\). In Sect. 8, we report the effectiveness of these global node similarity measures.

The global node similarity gives a node similarity measure from the global point of view. However, when \(G_1\) and \(G_2\) are not sufficiently similar to each other, using global node similarity only is not sufficient to get a good matching because the global node similarity does not consider the local information for nodes in two graphs. We need a local node similarity.

**Local node similarity:**For any node \(v\) in graph \(G\) and \(k\ge 0\), we define the \(k\)-neighborhood of \(v\), \(N_k(v)\), as the set of nodes in \(V(G)\) such that \(v\notin N_k(v)\) and for any \(u \in N_k(v)\), the shortest distance from \(v\) to \(u\) is no more than \(k\). The shortest distance is defined as the number of edges in the shortest path from \(v\) to \(u\). The \(k\)-neighborhood subgraph of \(v\) in \(G\), denoted as \(G_v^k\), is defined as the induced subgraph over \(N_k(v)\cup \{v\}\) in \(G\). For two nodes \(u\in V(G_1)\) and \(v\in V(G_2)\), we measure their local node similarity by comparing the \(k\)-neighborhood subgraphs of them. Suppose \(d(u)\) and \(d(v)\) are the degrees of node \(u\) and \(v\) in \(G_1\) and \(G_2\), respectively, and suppose \(d_{1,1}, d_{1,2}, \ldots \) is the degree sequence of node set \(N_k(u)\) in \(G_u^k\) sorted in non-increasing order, and \(d_{2,1}, d_{2,2}, \ldots \) is the degree sequence of node set \(N_k(v)\) in \(G_v^k\) sorted in non-increasing order. Let \(n_{min}=\min \{|N_k(u)|,|N_k(v)|\}\). We define a \(|V(G_1)|\times |V(G_2)|\) local node similarity matrix \(S_l\) as follows.

- (1)
\(0 < S_l[u,v] \le 1\).

- (2)
\(S_l[u,v] \ge \dfrac{(|V(\mathsf{mcs } (G_u^k,G_v^k))|\!+\!|E(\mathsf{mcs } (G_u^k,G_v^k))|)^2}{(|V(G_u^k)|\!+\!|E(G_u^k)|)(|V(G_v^k)|\!+\!|E(G_v^k)|)}\).

- (3)
If \(G_u^k\) and \(G_v^k\) are isomorphic, and \(u\) matches \(v\) in the optimal matching of \(G_u^k\) and \(G_v^k\), then \(S_l[u,v]=1\).

- (4)
If \(G_u^k\) is subgraph isomorphic to \(G_v^k\), and \(u\) matches \(v\) in the optimal matching of \(G_u^k\) and \(G_v^k\), we have \(S_l[u,v]=\dfrac{|V(G_u^k)|+|E(G_u^k)|}{|V(G_v^k)|+|E(G_v^k)|}\).

*MCS*. For (3), this can be obtained based on the illustration of the first property, since when they are isomorphism, we have \(n_{min}+1 = |V(G_u^k)| =|V(G_v^k)|\) and \( D(u,v) =|E(G_u^k)| = |E(G_v^k)|\). For (4), it is because when \(G_u^k\) is subgraph isomorphic to \(G_v^k\), we have \((n_{min}+1 = |V(G_u^k)|\) and \(D(u,v)= \dfrac{d(u) + \sum _{i=1}^{n_{min}}d_{1,i}}{2} = |E(G_u^k)|\), which leads to \(S_l[u,v] = \dfrac{|V(G_u^k)|+|E(G_u^k)|}{|V(G_v^k)|+|E(G_v^k)|}\).

Note that our local similarity [Eq. (4)] is different from the vector-based node signature [15] which deals with edge weights. For an undirected and unweighted graph, the edge weights for all its incident edges are 1. This means that the node signature in [15] is merely its node degree, and measuring the similarity of two nodes by their degrees is not sufficient, because there might be many pairs of nodes, which share the same degree but are with different structures. In our local similarity measure, we do not only consider the degrees of two nodes but also consider their \(k\)-neighborhoods. [15] is one specific case of our local similarity when \(k = 0\) for undirected and unweighted graphs.

*Example 2*

Reconsider the two graphs in Fig. 2. Let \(k = 2\). The similarity matrix \(S\) of \(G_{1}\) and \(G_{2}\) is shown in Fig. 3b. We construct a bipartite graph \(G_b\) with \(|V(G_1)|+|V(G_2)|\) nodes, and for any \(u\in V(G_1)\) and \(v\in V(G_2)\), we add an edge \((u,v)\in E(G_b)\) with weight \(S[u,v]\) (instead of \(S_g[u, v]\)). We compute the maximum weighted bipartite matching of \(G_b\) and get the matching \(M=\{(u_1, v_1),\,(u_2, v_{2}),\,(u_3, v_{3}),\)\((u_4, v_4),\,(u_5, v_5),\,(u_6, v_{12}),\)\((u_7, v_{13}),\,(u_8, v_8),\,(u_9, v_{17}),\)\(( u_{10}, v_{10} ),\,( u_{11}, v_{14} ),\) \((u_{12}, v_{6}),(u_{13}, v_{11}),\,( u_{14}, v_{15} ),\, (u_{15}, v_{9}),\,(u_{16}, v_{16})\}\). The number of matched edges is \(13\), which is better than \(10\) when only using the global similarity. But it is still much less than the optimal solution, \(21\).

**A problematic approach to compute**

*M*

**using**

*S*: Umeyama [32] computes a matching \(M\) by applying the Hungarian algorithm to the node similarity matrix, which can be with \(S\) we newly proposed or \(S_g\) given in [32]. Using all the similar node pairs computed, a matching \(M\) can be found. In order to compute a matching, Umeyama constructs a bipartite graph \(G_b\) that includes \(|V(G_1)|+|V(G_2)|\) nodes. For any node \(u\in V(G_1)\) and node \(v\in V(G_2)\), an edge \((u,v)\) is added to \(G_b\) with weight \(S[u,v]\) (or \(S_g[u,v]\)). The maximum weighted bipartite matching of \(G_b\) leads to a matching \(M\) of graphs \(G_1\) and \(G_2\). Such an approach has two drawbacks.

Similarity optimality does not mean matched edge optimality, while our aim is to maximize the number of matched edges in two graphs. It is possible that two nodes are very similar in terms of \(S\) (or \(S_g\)) but the two nodes do not have many incident edges that help to increase the number of matched edges. As an example, suppose node \(u_1\in V(G_1)\) and node \(v_1 \in V(G_2)\) all have degree \(1\), and \(S[u_1,v_1]=1.0\), and node \(u_2 \in V(G_1)\) and node \(v_2\in V(G_2)\) all have degree \(10\), and \(S[u_2,v_2]=0.9\). Suppose \((u_1,v_1)\) is in conflict with \((u_2,v_2)\) when computing the maximum weighted bipartite matching. In constructing the initial matching, the algorithm may give up \((u_2,v_2)\) because it has a lower similarity. But obviously, giving up \((u_1,v_1)\) is a better solution because \((u_2,v_2)\) can contribute a larger number of matched edges, although \(u_2\) and \(v_2\) have lower node similarity.

This approach only considers the matching of individual nodes in two graphs, and does not consider whether the nodes around them can be well matched when it matches two nodes. In other words, matching \(u\in V(G_1)\) with \(v\in V(G_2)\) does not consider whether the nodes around \(u\) and \(v\) can be matched using the maximum weighted bipartite matching. When the nodes around \(u\) and \(v\) are mismatched, even if \(u\) and \(v\) are similar, it can significantly affect the quality of the final matching \(M\).

### 5.2 Anchor selection and expansion

- (1)\(\min \{d(u),d(v)\}\ge \delta \), where \(\delta \) is the larger average degree of the two graphs, that is,$$\begin{aligned} \delta = \max \left\{ \frac{2\times |E(G_1)|}{|V(G_1)|}, \frac{2\times |E(G_2)|}{|V(G_2)|}\right\} . \end{aligned}$$
- (2)
\(S[u,v]\ge \tau \), where \(\tau \) is a threshold and generally \(\tau > 0.5\), and is one sensitive threshold that has impacts on graph matching. We will discuss it in Sect. 5.3.

*Example 3*

Consider the two graphs in Fig. 2. Suppose \(\tau =0.94\), using Algorithm 2, we can get the set of anchor pairs to be \(\mathcal A =\{(u_1,v_1),(u_8,v_8)\}\). Obviously, the correct matching of the two pairs is very important in the final matching of \(G_1\) and \(G_2\). For the pair \((u_9,v_{17})\), although it satisfies the similarity constraint, it destroys the degree constraint. Obviously, expanding from the pair \((u_9, v_{17})\) to match other pairs is a bad choice.

**Theorem 1**

The time complexity of Algorithm 2 is \(O(\)\( |V(G_1)|^2 \cdot (|V(G_1)| + |E(G_1)|) + |V(G_2)|^2 \cdot (|V(G_2)| + |E(G_2)|) )\).

*Proof 1*

Algorithm 2 is to select anchors. Computing the global node similarity matrix needs \(O(|V(G_{1})|^{3} + |V(G_{2})|^{3})\) time, and computing the local node similarity matrix needs \(O(|V(G_1)|^2 \cdot |E(G_1)| + |V(G_2)|^2 \cdot |E(G_2)|)\) time. In lines 3-5, sorting all pairs needs \(O(|V(G_1)| \cdot |V(G_2)| \cdot (\log (|V(G_1)|) + \log (|V(G_2)|)))\) time. Hence, the overall time complexity of Algorithm 2 is \(O( |V(G_1)|^2 \cdot (|V(G_1)| + |E(G_1)|) + |V(G_2)|^2 \cdot (|V(G_2)| + |E(G_2)|))\). \(\square \)

We illustrate the anchor expansion algorithm (Algorithm 3) to obtain a matching \(M\). Let \(\mathcal A \) be the anchor pairs \((u,v)\) selected already. Initially, \(M = \mathcal{A}\). Let \(N(u)\) and \(N(v)\) denote the immediate neighbors of \(u\) and \(v\) in graphs \(G_1\) and \(G_2\), respectively. For every matched pair \((u,v)\) in the initial \(M\), we put all \((N(u) \times N(v))\) pairs in a queue \(\mathcal Q \), where \(\mathcal Q \) is the set of candidate matching pairs sorted in decreasing order of their local similarity. In an iterative manner, we remove the pair \((u,v)\) with the largest local similarity \(S_l[u,v]\) [Eq. (4)] from \(\mathcal Q \). If both \(u\) and \(v\) have not been matched before, we add \((u,v)\) to \(M\) and put their all \((N(u) \times N(v))\) immediate neighbor pairs into \(\mathcal Q \) for further consideration. We repeat it until \(\mathcal Q = \emptyset \).

The example of anchor expansion is given below.

*Example 4*

Given the two graphs in Fig. 2. After we get the set of anchor pairs \(\mathcal A = \{(u_1,v_1),(u_8,v_8)\}\). Using Algorithm 3, we can construct our matching \(M=\{(u_1,v_1),\,(u_2,v_7),\,(u_3,v_3),\,(u_4,v_4),\,(u_5,v_5),\,(u_6,v_6),\)\((u_7,v_2),\,(u_8,v_8),\,(u_9,v_9),\,(u_{10},v_{10}),\,(u_{11},v_{13}),\)\((u_{12}, v_{12}), (u_{13},\,v_{11}),\,(u_{14},v_{14}),\, (u_{15},v_{15}),\,(u_{16},v_{16})\}\). The number of matched edges is 18.

**Theorem 2**

The time complexity of Algorithm 3 is \(O( \)\(|V(G_1)| \cdot |V(G_2)| \cdot \min \{|V(G_1)|, |V(G_2)| \} )\).

*Proof 2*

Algorithm 3 is to expand from the anchors selected. The dominant part of the time complexity is line 5 and line 7. For line 5, there are at most \(|V(G_1)|\cdot |V(G_2)|\) pairs, and for each pair, it needs \(O(|V(G_1)|\cdot |V(G_2)|)\) to obtain the one with the largest similarity. For line 7, there are at most \(\min \{|V(G_1)|,|V(G_2)|\}\) matched pairs, and for each pair, it needs \(O(|V(G_1)|\cdot |V(G_2)|)\) to compute the cartesian product. Therefore, the overall time complexity is \(O(|V(G_1)|\cdot |V(G_2)| \cdot \min \{|V(G_1)|,|V(G_2)|\})\). \(\square \)

### 5.3 Discussion on \(\tau \) for anchor selection

In the matching construction step, the threshold \(\tau \) used in \(\mathsf anchor \)-\(\mathsf selection \) (Algorithm 2) is an important factor for the matching quality. It should be neither too large nor too small. When \(\tau \) is too large, very few nodes will be selected as anchors, which lead to more nodes to be mismatched in \(\mathsf anchor \)-\(\mathsf expansion \). The reason is that \(\mathsf anchor \)-\(\mathsf expansion \) is designed as a greedy algorithm and can only achieve local optimum. For a node in a graph, the more steps it needs to be expanded from an anchor, the higher the probability to be mismatched. When \(\tau \) is too small, a large number of anchor pairs may be selected, and many mismatched anchors are thus involved. Expanding from these mismatched anchors will hardly lead to a good matching result. We explain it using an example.

*Example 5*

Reconsider Example 3 in Sect. 5. Suppose we set \(\tau \) to a very small value, that is, \(\tau =0.78\), for graphs in Fig. 2. A large set of anchor pairs is obtained: \(\mathcal A = \{(u_1, v_1), (u_4, v_4), (u_5, v_5), (u_6, v_{12}), (u_7, v_{13}), (u_8, v_8),\)\((u_{10}, v_{10}), (u_{12}, v_6)\}\). If we then run \(\mathsf anchor \)-\(\mathsf expansion \) based on this \(\mathcal A \), we obtain the matching result \(M = \{(u_1,v_1),\,(u_2,v_2),\,(u_3,v_3),\,(u_4,v_4),\,(u_5,v_5),\)\((u_7,v_{13}),\)\((u_8,v_8),\,(u_9,v_{17}),\,(u_{10},v_{10}),\,(u_{12},v_6),\)\((u_{13},\,v_{11}),\,\!(u_{14}, v_{14}),(u_{15},v_{16}),(u_{16},v_{15})\}\). The number of matched edges is 16, which is not as good as the matching result in Example 4 (The number of matched edges is 18). This is because a small \(\tau \) makes several mismatched anchor pairs \(\{(u_6, v_{12}), (u_7, v_{13}), (u_{12},\)\( v_6)\}\) included in \(\mathcal A \). Expanding from such mismatched anchor pairs will make the matching result ineffective. On the other hand, suppose we set \(\tau \) to a very high value, that is, \(\tau =0.98\). No anchor pair will be selected. Under such circumstances, one node will be randomly selected from each graph to form an anchor pair, and expanding from such a random anchor pair is unlikely to lead to a good matching.

**Theorem 3**

The time complexity of Algorithm 4 is \(O(\)\( |V(G_1)|^2 \cdot (|V(G_1)| + |E(G_1)|) + |V(G_2)|^2 \cdot (|V(G_2)| + |E(G_2)|) )\).

*Proof 3*

The dominant part of Algorithm 4 is line 2 and lines 3–7. For line 2, the time complexity of \(\mathsf anchor \)-\(\mathsf selection \) is shown to be \(O(\)\( |V(G_1)|^2 \cdot (|V(G_1)| + |E(G_1)|) + |V(G_2)|^2 \cdot (|V(G_2)| + |E(G_2)|) )\) in Theorem 1. For lines 3–7, since the time complexity of \(\mathsf anchor \)-\(\mathsf expansion \) is shown to be \(O( \)\(|V(G_1)| \cdot |V(G_2)| \cdot \min \!\{|V(G_1)|,\! \)\(|V(G_2)| \} )\) in Theorem 2, the total complexity of lines 3–7 is \(O(c\cdot \)\(|V(G_1)| \cdot |V(G_2)| \cdot \min \! \{|V(G_1)|, \!|V(G_2)| \} )\) where \( c \) is the steps that we need in the loop. Usually, \(c\!=\! 0.5/l\) is a small constant for a given \(l\). So the total complexity of Algorithm 4 is \(O(\)\( |V(G_1)|^2 \cdot (|V(G_1)| + |E(G_1)|) + |V(G_2)|^2 \cdot (|V(G_2)| + |E(G_2)|))\). \(\square \)

The time complexity of Algorithm 4 remains unchanged, compared to Algorithm 2 and Algorithm 3, because it only repeats \(\mathsf anchor \)-\(\mathsf expansion \) constant times. It is worth noting that \(\mathsf anchor \)-\(\mathsf selection \) is the dominant factor and \(\mathsf anchor \)-\(\mathsf expansion \) can be done very quickly in practice compared to \(\mathsf anchor \)-\(\mathsf selection \). Some results are shown in Table 3 in Sect. 8, where the time of \(\mathsf anchor \)-\(\mathsf expansion \) means the total expansion time including the \(\tau \) selection.

## 6 Matching refinement

The initial matching \(M\) is computed using the heuristics that match the anchors first followed by matching the nodes around the anchors in a top-down fashion. The heuristics used cannot guarantee that all the anchors are correctly matched. In this section, we propose a new approach to refine the initial matching \(M\). It is important to note that our strategy is to refine the initial matching and is not to find a completely new matching. By refinement, we mean the following two things. First, we are not to explore all possibilities without a goal when we refine a matching. In other words, we refine a matching \(M\) to a better one which is most likely to exist and can be identified. Second, we consider the efficiency when refining a matching. In our approach, each time we focus on a subset of nodes to refine by excluding a subset of nodes and including a subset of nodes. The set of nodes to be excluded from refinement at one time is neither large nor small. Also, we give every node in the graphs a chance to be refined.

### 6.1 Vertex cover based refinement

We use a vertex cover \(C\) to refine a matching \(M\). A vertex cover \(C\) of a graph \(G\) is a subset of nodes in \(V(G)\), that is, \(C\subseteq V(G)\), such that for every edge \((u,v)\in E(G)\), we have \(u\in C\) or \(v\in C\). A minimum vertex cover of graph \(G\) is a vertex cover with the minimum number of nodes. A vertex cover \(C\) of \(G\) is a minimal vertex cover, if there does not exist a vertex cover \(C^{\prime }\) of \(G\) such that \(C^{\prime }\subset C\). A set of nodes \(C\) is a vertex cover of graph \(G\) if and only if its complement \(I=V(G)-C\) is an independent set of \(G\). Here, an independent set \(I\) of \(G\) is a subset of nodes in \(V(G)\), that is, \(I\subseteq V(G)\), such that for any \(u\in I\) and \(v\in I\), \((u,v)\notin E(G)\).

The vertex cover structure plays an important role when we match two graphs \(G_1\) and \(G_2\). It allows us to focus on one graph \(G_1\), with the assistance of its vertex cover. The intuition is as follows. By definition, a vertex cover of \(G_1\) is the set of nodes that covers all possible edges in \(G_1\). This implies that a node in the vertex cover can possibly have many edges to cover (or possibly have many matched edges with another graph \(G_2\)).

A vertex cover \(C\) of \(G_1\) divides \(V(G_1)\) into three parts, \(F_1 = C \cap P_1\), \(C - F_1\) and \(V(G_1) - C\). The implications are given below. The nodes in \(F_1\) are most likely to lead to good matches, based on the definition of vertex cover. We exclude nodes in \(F_1\) to refine. We include nodes in \(V(G_1) - C\) to refine, because the complement of the vertex cover \(V(G_1) - C\) is an independent set. Such a property makes it possible to apply some efficient polynomial algorithms for optimizing the matching. For \(C - F_1\), we will first discuss how to refine by excluding nodes in \(C - F_1\), and then discuss how to include nodes \(C - F_1\) to refine.

### 6.2 Refinement and its optimality

Given two graphs \(G_1\) and \(G_2\), a matching \(M\), and a vertex cover \(C\) of \(G_1\), we give a refinement \(M^+(C)\) of \(M\), and show its optimality below.

*Example 6*

Second, we give the optimality of \(M^+(C)\) over a matching space \(\mathcal M \). The space \(\mathcal M \) is a set of matchings between nodes in \(G_1\) and \(G_2\), such that for any matching \(M^{\prime }\), \(M^{\prime }\in \mathcal M \) if and only if \(M^{\prime }\cap (F_1\times F_2) = M\cap (F_1 \times F_2)\) and \(M^{\prime }\cap ((C-F_1)\times V(G_2)) = \emptyset \). For the matching \(M\), a matching \(M^{\prime }\in \mathcal M \), if and only if the matching for nodes in \(F_1\) is not changed and the matching for nodes in \(C-F_1\) is \(\emptyset \). The second condition can also be expressed as \(M^{\prime }[C-F_1]=\emptyset \).

**Theorem 4**

- (1)
\(|\mathcal M | = \)\(\sum _{i=0}^{min} \dfrac{min!\times max!}{i!\times (min-i)!\times (max-i)!}\) and \(\dfrac{max!}{(max-min)!} \le |\mathcal M | \le (max+1)^{min}\),

- (2)
\(M\in \mathcal M \),

- (3)
\(M^+(C) \in \mathcal M \) and

- (4)
\(M^+(C)\) is optimal in \(\mathcal M \).

*Proof 4*

- (1)To make things simple and without loss of generality, we assume \(|V(G_1)| - |C| \le |V(G_2)| - |F_2|\), then \(min = |V(G_1)| - |C|\) and \(max = |V(G_2)| - |F_2|\). Since \(V(G_1) - C\) and \(V(G_2) - F_2\) are the included parts of \(G_1\) and \(G_2\), respectively, we only consider the number of different matchings between \(V(G_1)-C\) and \(V(G_2) - F_2\). Suppose in \(V(G_1) - C\), there are \(i\) nodes that participate in the matching in \(\mathcal M \), there are \(C_{min}^i\) different selections of the \(i\) nodes, and for each selection, there are \(P_{max}^i\) different matchings between the \(i\) nodes and nodes in \(V(G_2) - F_2\). There are totally \(C_{min}^i \times P_{max}^i\) different matchings for a certain \(i\). Since \(i \in [0, min]\), the total number of different matchings isWhen \(i = min\), we have:$$\begin{aligned} |\mathcal M |&= \sum _{i=0}^{min}C_{min}^i \times P_{max}^i \\&= \sum _{i=0}^{min} \frac{min! \times max!}{i! \times (min-i)! \times (max-i)!} \end{aligned}$$If we remove the constraint that different nodes in \(V(G_1) - C\) must match different nodes in \(V(G_2) - F_2\), each node in \(V(G_1) - C\) will have \(max + 1\) choices include \(max\) nodes in \(V(G_2) - F_2\) and an empty match. The number of different relaxed matchings is then changed to \((max+1)^{min}\) which is an upper bound of \(|\mathcal{M}|\).$$\begin{aligned} |\mathcal M | \ge C_{min}^i \times P_{max}^i = \frac{max!}{(max - min)!} \end{aligned}$$
- (2)
We only need to prove that \(M\) satisfies the two conditions of \(\mathcal M \). For the first condition, obviously, \(M\cap (F_1\times F_2) = M\cap (F_1 \times F_2)\). For the second condition, the part \(C-F_1\) is the nodes in \(C\) that are not matched in \(M\), so \(M[C-F_1]=\emptyset \). As a result, \(M\cap ((C-F_1)\times V(G_2)) = \emptyset \).

- (3)We need show that \(M^+(C)\) satisfies the two conditions of \(\mathcal M \).
- For the first condition, we have:Since \(M_b\) only includes nodes in \(V(G_1)-C\) and \(V(G_2)-F_2\), we have \(M_b\cap (F_1\times F_2)=\emptyset \). As a result, \(M^+(C)\cap (F_1\times F_2)=M\cap (F_1\times F_2)\).$$\begin{aligned}&M^+(C) \cap (F_1 \times F_2)\\&\quad =((M \cap (F_1 \times F_2)) \cup M_b)\cap (F_1 \times F_2) \\&\quad =(M \cap (F_1 \times F_2)) \cup (M_b \cap (F_1 \times F_2)) \end{aligned}$$
- For the second condition, we have:Moreover, we have \((M\cap (F_1 \times F_2))\cap ((C-F_1)\times V(G_2))=\emptyset \), because \(M\cap ((C-F_1) \times V(G_2)) = \emptyset \) is already proved in (2) and \(M_b \cap ((C-F_1) \times V(G_2)) = \emptyset \) due to the fact that \(M_b\) does not contain any nodes in \(C-F_1\). Thus, we have \(M^+(C)\cap ((C-F_1)\times V(G_2)) = \emptyset \).$$\begin{aligned}&M^+(C) \cap ((C-F_1)\times V(G_2))\\&=((M\cap (F_1\times F_2)) \cup M_b) \cap ((C-F_1)\times V(G_2))\\&=((M\cap (F_1\times F_2))\cap ((C-F_1)\times V(G_2))) \\&\cup (M_b \cap ((C-F_1)\times V(G_2))) \end{aligned}$$

- (4)For any matching \(M^{\prime }\in \mathcal M \), we define a matching \(M_b^{\prime }\) as \(M_b^{\prime } = M^{\prime } \cap ((V(G_1)-C) \times (V(G_2) - F_2))\). We use \(score_b(M_b)\) to denote the total weight for the bipartite matching \(M_b\) of the bipartite graph \(G_b\). We claim: (a) \(M_b^{\prime }\) is a bipartite matching of \(G_b\); (b) \(score(M^{\prime }) = score_b(M_b^{\prime }) + score(M\cap (F_1 \times F_2))\); (c) \(score(M^+(C)) = score_b(M_b) + score(M\cap (F_1 \times F_2))\).
For (a), it is obvious because of two reasons. (1)\(M_b^{\prime }\) only contains the nodes in \(V(G_1)-C\) and \(V(G_2)-F_2\), which is exactly the set of nodes in \(G_b\). (2) Any edge in \(M_b^{\prime }\) is also an edge of \(G_b\) since \(G_b\) is a complete bipartite graph.

- For (b), we have:Since \(C\times F_2\), \(C\times (V(G_2)-F_2)\), \((V(G_1)-C)\times F_2\) and \((V(G_1)-C)\times (V(G_2)-F_2)\) are mutually exclusive with each other, we have:$$\begin{aligned}&score(M^{\prime })\\&\quad =score(M^{\prime }\cap ((C\cup (V(G_1)-C)) \\&\quad \quad \times (F_2\cup (V(G_2)-F_2))))\\&\quad =score((M^{\prime }\cap (C \times F_2))\\&\quad \quad \cup (M^{\prime } \cap (C \times (V(G_2) -F_2)))\\&\quad \quad \cup (M^{\prime }\cap ((V(G_1)-C)\times F_2))\\&\quad \quad \cup (M^{\prime }\cap ((V(G_1)-C)\times (V(G_2)-F_2)))) \end{aligned}$$Since \(M^{\prime }[F_1]=F_2\) and \(M^{\prime }[C-F_1]=\emptyset \), we have \(M^{\prime }[C]=M^{\prime }[F_1]\cup M^{\prime }[C-F_1]=F_2\), and thus \(M^{\prime }\cap (C\times (V(G_2)-F_2))=\emptyset \). Since \(M^{\prime -1}[F_2]=F_1\) and \(F_1\subseteq C\), we have \(M^{\prime }\cap ((V(G_1)-C)\times F_2)=\emptyset \). We also have:$$\begin{aligned}&\!\!score(M^{\prime })\\&=score(M^{\prime }\cap (C\times F_2)) \\&\quad +score(M^{\prime }\cap (C\times (V(G_2)-F_2))) \\&\quad +score(M^{\prime }\cap ((V(G_1)-C)\times F_2)) \\&\quad +score(M^{\prime }\cap ((V(G_1)-C)\times (V(G_2)-F_2))) \end{aligned}$$The last equation is due to \(M^{\prime }\cap ((C-F_1)\times F_2) = \emptyset \) because \(M^{\prime }[C-F_1]=\emptyset \). Since \(M^{\prime }\cap (F_1\times F_2) = M \cap (F_1 \times F_2)\), we can derive:$$\begin{aligned}&M^{\prime }\cap (C\times F_2)\\&= M^{\prime }\cap (((C-F_1)\cup F_1)\times F_2)\\&= (M^{\prime }\cap ((C-F_1)\times F_2)) \cup (M^{\prime }\cap (F_1\times F_2)) \\&= M^{\prime }\cap (F_1\times F_2) \end{aligned}$$We only need to prove \(score(M_b^{\prime })=score_b(M_b^{\prime })\). Since \(V(G_1)-C\) is a independent set which only have edges with \(C\) and \(C\) is the excluded part of the matching \(M^{\prime }\), we can derive that \(score(M^{\prime } \cap ((V(G_1)-C)\times (V(G_2)-F_2)))\) only consists of the contributions of the edges \((u,v) \in E(G_1)\) such that \(u\in C\) and \(v\in V(G_1)-C\). From the construction of \(G_b\), the contribution for each \(v\in V(G_1)-C\) in \(M^{\prime }\) is just the weight \((v,M^{\prime }[v])\) in the bipartite graph \(G_b\). This implies \(score(M_b^{\prime }) = score_b(M_b^{\prime })\).$$\begin{aligned}&\!\!\!score(M^{\prime })\\&=score(M^{\prime }\cap (F_1\times F_2)) \\&\quad +score(M^{\prime }\cap ((V(G_1)-C)\times (V(G_2)-F_2)))\\&=score(M\cap (F_1\times F_2))+score(M_b^{\prime }) \end{aligned}$$
For (c), it can be easily derived from (b) because \(M^+(C)\)\(\in \mathcal M \) and \(M_b=M^+(C)\cap ((V(G_1)-C)\times (V(G_2)-F_2))\).

Theorem 4 shows that the size of \(\mathcal M \) is exponentially large. Both \(M\) and \(M^+(C)\) are elements in \(\mathcal{M}\), and \(M^+(C)\) is the optimal matching for all matchings in \(\mathcal M \). It implies that \(M^+(C)\) is the best among a large number of matchings in \(\mathcal M \) and \(score(M^+(C))\)\(\ge score(M)\). For two graphs with 2,000 nodes each, the number of nodes in a vertex cover can be assumed as 1,000 (50 %) reasonably. \(M^+(C)\) is the best among a factorial of 1,000 (1,000\(!\)) possible matchings.

### 6.3 Randomly refinement excluding \(C - F_1\)

If \(M\) itself is an optimal matching in \(\mathcal M \), or the selected vertex cover \(C\) includes most nodes in \(G_1\) that are not well matched, it is possible that \(M^+(C)\) cannot improve \(M\). As an example, suppose \(C=\{u_{11},u_{12}, u_{13}\}\), in Example 6, then the new bipartite graph \(G_b\) is the one shown in Fig. 5d. In other words, using the maximum weighted bipartite matching of \(G_b\), the matching \(M^+(C)\) might be the same with \(M\). The reason is that the mismatched nodes are excluded by the vertex cover \(C\) to refine. We give an approach based on two strategies to solve such a problem. (1) Making \(C\) smaller, such that more mismatched nodes can be included and thus can be used to refine. (2) Iteratively refining the current matching using different vertex covers, such that every mismatched node will have a chance to be included to refine. The first strategy is based on the following Lemma.

**Lemma 1**

For any two vertex covers \(C_1\) and \(C_2\) of \(G_1\), if \(C_1\subseteq C_2\), then \(score(M^+(C_1))\ge score(M^+(C_2))\).

*Proof 5*

The approach to randomly select a minimal vertex cover of graph \(G\) is shown in Algorithm 5. First, in line 1, we shuffle all nodes in the graph and put them into a list \(\mathcal L \), such that any permutation of \(V(G)\) has the same probability in \(\mathcal L \). In lines 2–3, we find a vertex cover of \(G\) by adding node in \(\mathcal L \) one by one. For any node to be added, we add it into the vertex cover if and only if it contributes at least one edge to the currently covered edges (line 3). This operation can be implemented as follows. For every node in the graph, we maintain its number of uncovered edges, which is initially set to be the degree of the corresponding node. Every time before we add a new node into the cover, we first check its number of uncovered edges. If it is \(0\), we skip the node, and continue to add the next one in \(\mathcal{L}\). Otherwise, we add the node into the cover, and traverse its adjacent nodes in the graph. For each adjacent node, we decrease its number of uncovered edges by \(1\). In such a way, the total complexity for line 2–3 is \(O(|E(G)|)\), since every edge in \(G\) is visited at most once. Lines 4–5 make the current vertex cover minimal by removing those useless nodes, such that the removal of such nodes does not influence any edge currently covered. The following lemma shows that, for any minimal cover \(C\) of a graph \(G\), there are considerable number of ways for Algorithm 5 to generate \(C\).

**Lemma 2**

For any minimal vertex cover \(C\) of graph \(G\), there are at least \(|C|!\times |V(G)-C|!\) permutations of \(V(G)\), such that Algorithm 5 generates \(C\).

*Proof 6*

We construct the \(|C|!\times |V(G)-C|!\) permutations as follows. For each permutation, we put \(C\) in the front in any order followed by \(V(G)-C\) in any order. The number of such permutations is \(|C|!\times |V(G)-C|!\). Now we prove for any such permutation, Algorithm 5 can generate \(C\). Since \(C\) is minimal, in the first \(|C|\) loops of lines 2–3 of Algorithm 5, the conditions in line 3 are all satisfied, and in the last \(|V(G)-C|\) loops of lines 2–3, the conditions in line 3 are all unsatisfied because \(C\) is already a vertex cover of \(G\). So after the loop in lines 2-3, \(C\) is generated. Since \(C\) is already minimal, the loop in lines 4–5 will eliminate no node. Thus, Algorithm 5 can generate \(C\). \(\square \)

The main refine approach is an iterative algorithm shown in Algorithm 6. We iteratively update the current matching until the matching is not improved in a certain iteration. In each iteration (lines 2–6), we try \(X\) times to find a new random minimal vertex cover \(C\) (line 4), generate the matching \(M^+(C)\) using the method introduced above (line 5), and update the current matching if \(M^+(C)\) is a better matching (line 6). Here, \(X\) is a constant (\(\ge 1\)) in order to avoid selecting a bad cover to terminate the whole process. In our experiments, when \(X = 5\) and \(X = 10\) over 92 and 99 % of the nodes have a chance to be included to refine. We use \(X = 5\). Note that in line 3, we choose \(C\) to be a vertex cover of either \(G_1\) or \(G_2\) with the same probability to increase the randomness.

**Theorem 5**

The time complexity of Algorithm 6 is \(O(m\cdot n^3)\), for \(m = \min \{|E(G_1)|,|E(G_2)|\}\) and \(n = \max \{\)\(|V(G_1)|,\)\( |V(G_2)|\}\).

*Proof 7*

Algorithm 6 is the main refinement. The while loop in line 1 will repeat for at most \(m\) times because the optimal solution can match at most \(m\) edges and in each loop, the number of edges for the latest solution will be increased for at least \(1\). In each loop, the dominant part is finding the maximum weight bipartite matching using the Hungarian algorithm which can be done in \(O(n^3)\). Since \(X\) is a constant, the total time complexity for Algorithm 6 is \(O(m\cdot n^3)\). \(\square \)

### 6.4 Randomly refinement including \(C - F_1\)

- (1)
\( M^{\prime }\cap (F_1\times F_2) = M\cap (F_1 \times F_2) \)

- (2)
\(F_1 \cup (V(G_1)-P_1-M^{\prime -1}[V(G_2)-P_2])\) is a vertex cover of \(G_1\).

**Theorem 6**

\(\mathcal M \subseteq \mathcal M ^*\) and suppose \(M_\mathcal M ^*\) is the optimal solution among all matchings in \(\mathcal M ^*\), we have \(score(M^*(F_1))\ge score(M_\mathcal M ^*)\ge score(M^+(C))\).

*Proof 8*

For \(\forall \)\(M^{\prime }\in \mathcal M \), we have \(M^{\prime }\cap (F_1\times F_2) = M\cap (F_1 \times F_2)\) and \(M^{\prime }[C-F_1] = \emptyset \). The first condition is the same as the first condition of \(\mathcal M ^*\). Since \(M^{\prime }[C-F_1] = \emptyset \), we have \((C-F_1)\cap M^{\prime -1}[V(G_2)-P_2]=\emptyset \). We also have \((C-F_1) \cup M^{\prime -1}[V(G_2)-P_2] \subseteq V(G_1)-P_1\), accordingly, \(C-F_1\subseteq V(G_1)-P_1-M^{\prime -1}[V(G_2)-P_2]\), and thus \(C\subseteq F_1 \cup (V(G_1)-P_1-M^{\prime -1}[V(G_2)-P_2])\). Since \(C\) is a vertex cover of \(G_1\), \(F_1 \cup (V(G_1)-P_1-M^{\prime -1}[V(G_2)-P_2])\) is a vertex cover of \(G_1\), hence we have \(M^{\prime }\in \mathcal M ^*\). Thus \(\mathcal M \subseteq \mathcal M ^*\) holds.

Theorem 6 implies that the new space \(\mathcal{M}^*\) is larger than the space \(\mathcal M \) in refinement excluding \(C - F_1\) , and the new matching \(M^*(F_1)\) is no worse than the optimal matching in \(\mathcal M ^*\). This implies that \(score(M^*(F_1))\ge score(M^+(C))\), where \(M^+(C)\) is the optimal matching in \(\mathcal M \). It is worth noticing that the cover \(C\) of \(G_1\) does not participate in the construction of \(M^*(F_1)\) directly. The matching \(M^*(F_1)\) can be computed as long as \(F_1\) is generated, and \(F_1\) can be computed easily by the following lemma.

**Lemma 3**

Suppose \(G_1[P_1]\) is the subgraph of \(G_1\) induced by \(P_1\). If \(C\) is a vertex cover of \(G_1\), then \(C\cap P_1\) is a vertex cover of \(G_1[P_1]\), and if \(C_{P_1}\) is a vertex cover of \(G_1[P_1]\), then there exists a vertex cover \(C\) of \(G_1\) such that \(C_{P_1}\subseteq C\).

*Proof 9*

We first prove that if \(C\) is a vertex cover of \(G_1\), then \(C\cap P_1\) is a vertex cover of \(G_1[P_1]\). Suppose \(C\cap P_1\) is not a vertex cover of \(G_1[P_1]\), then there exists an edge \((u,v) \in E(G_1[P_1])\) such that \(u\notin C\cap P_1\) and \(v \notin C\cap P_1\). Note that \(C\) is a vertex cover of \(G_1\), we have \(u\in C\) or \(v\in C\). Without loss of generality, we suppose \(u\in C\). Since \(u\notin C\cap P_1\), we have \(u\in C-(C\cap P_1)\), which contradicts with \(u\in V(G_1[P_1])\). Thus, \(C\cap P_1\) is a vertex cover of \(G_1[P_1]\).

We then prove that if \(C_{P_1}\) is a vertex cover of \(G_1[P_1]\), then there exists a vertex cover \(C\) of \(G_1\) such that \(C_{P_1}\subseteq C\). We only need to prove that \(C=C_{P_1} \cup (V(G_1)-P_1)\) is a vertex cover of \(G_1\). For any \((u,v)\in E(G_1)\), if \(u\in P_1\) and \(v \in P_1\), \((u,v)\) is covered by \(C\) because \(C_{P_1}\) is a vertex cover of \(G_1[P_1]\). Otherwise, without loss of generality, we suppose \(u\notin P_1\), then \(u\in V(G_1)-P_1\subseteq C\), so \((u,v)\) is also covered by \(C\). As a result, all edged in \(E(G_1)\) can be covered by \(C\), thus \(C\) is a vertex cover of \(G_1\). \(\square \)

Based on Lemma 3, we can derive that the vertex cover of \(G(P_1)\), \(F_1\), is enough to generate \(M^*(F_1)\). Our new refinement algorithm is shown in Algorithm 7 which is the \(\mathsf refine \) used in Algorithm 1. We use \(X = 5\). Comparing to Algorithm 6, there are two major modifications. The first is about the cover computing in lines 3–4, instead of computing the cover of \(G_1\) (or \(G_2\) if we select \(G_2\) as the first graph in line 3), we only compute the vertex cover of \(G_1[P_1]\) (or \(G_2[P_2]\)). For the second modification, instead of computing \(M^+(C)\), we compute our new matching \(M^*(F)\).

## 7 Labeled graph handling

In previous sections, we concentrate on unlabeled graphs. In this section, we discuss how to handle node-labeled graphs. Given a set of node-labels \(\Sigma _V\), a node-labeled graph is denoted by \( G =(V, E, l)\). Here, \(l\) is a labeling function: \(V \rightarrow \Sigma _V\). We use \(l(u)\) to denote the label of \(u\) for every node \(u \in V(G)\). The definitions of graph isomorphism and graph matching are given as follows.

**Definition 4**

Graph/Subgraph Isomorphism (labeled graph). Graph \(G_1(V, E, l_1)\) is isomorphic to graph \(G_2(V, E, l_2)\), if and only if there exists a bijective function \(f: V(G_1) \rightarrow V(G_2)\) such that for any two nodes \(u_1 \in V(G_1)\) and \(u_2 \in V(G_1)\), \((u_1,u_2) \in E(G_1)\) if and only if \(l_1(u_1) = l_2(f(u_1))\), \(l_1(u_2) = l_2(f(u_2))\), and \((f(u_1), f(u_2)) \in E(G_2)\). \(G_1\) is subgraph isomorphic to \(G_2\), if and only if there exists a subgraph \(G^{\prime }\) of \(G_2\) such that \(G_1\) is isomorphic to \(G^{\prime }\).

**Definition 5**

Graph Matching (labeled graph). Given two graphs \(G_1(V, E, l_1)\) and \(G_2(V, E, l_2)\), a matching \(M\) between \(G_1\) and \(G_2\) is a set of pairs \(M = \{(u, v)| u \in V(G_1), v \in V(G_2), l_1(u) = l_2(v) \}\), such that for any two pairs \((u_1, v_1)\)\(\in M\) and \((u_2, v_2)\in M\), \(u_1 \ne u_2\) and \(v_1 \ne v_2\). The optimal matching \(M\) of two graphs is the one with the largest number of matched edges. Finding the optimal matching \(M\) is the same as *MCS*.

For handling node-labeled graphs, the matching construction and matching refinement need to be modified.

- (1)
\(0\le S_l[u,v] \le 1\).

- (2)
\(S_l[u,v] \ge \dfrac{(|V(\mathsf{mcs } (G_u^k,G_v^k))|+|E(\mathsf{mcs } (G_u^k,G_v^k))|)^2}{(|V(G_u^k)|+|E(G_u^k)|)(|V(G_v^k)|+|E(G_v^k)|)}\) if the node-labels of \(u\) and \(v\) are the same.

- (3)
If \(G_u^k\) and \(G_v^k\) are isomorphic, and \(u\) is matched to \(v\) in the optimal matching of \(G_u^k\) and \(G_v^k\), then \(S_l[u,v]=1\).

- (4)
If \(G_u^k\) is subgraph isomorphic to \(G_v^k\), and \(u\) matches \(v\) in the optimal matching of \(G_u^k\) and \(G_v^k\), we have \(S_l[u,v]=\dfrac{|V(G_u^k)|+|E(G_u^k)|}{|V(G_v^k)|+|E(G_v^k)|}\).

## 8 Performance studies

We compare our algorithms, \(\mathsf cons \) (matching construction only) and \(\mathsf consR \) (matching construction plus matching refinement), with five state of the art graph matching algorithms: \(\mathsf ume \), \(\mathsf heat \), \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \). Here, \(\mathsf ume \) is the improved Umeyama algorithm in [17, 32], and \(\mathsf heat \), \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \) are the algorithms proposed in [28, 34, 36], and [37], respectively. We implement \(\mathsf ume \), \(\mathsf heat \) and our algorithm using Visual C++ 2005 and Matlab R2009a. The C++ part calls Matlab to compute the eigenvalues and eigenvectors of the matrix, and executes the rest of the algorithm in C++. For \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \), we download the source code of the graph matching package GraphM.^{2} All tests were conducted on a 2.66 GHz CPU and 3.43 GB memory PC running Windows XP.

We evaluated the algorithms using both real and synthetic datasets. The real datasets include the Power Network and the NCI dataset. The synthetic datasets are generated using two models, namely Scale-Free/Power Law Model and Erdos Renyi Model. We use software Pajek^{3} to generate graphs under these two models. We mainly focus on testing unlabeled graphs, and discuss the results for labeling graphs in Sect. 8.5.

**Power Network (PN)**is the electrical power grid of the western US selected from the University of Florida Sparse Matrix Collection.

^{4}The nodes are generators, transformers, and substations, and the edges are the high-voltage transmission lines between them. The graphs are proved to be power law networks [5, 33]. The dataset contains graphs with number of nodes varying from 39 to 5,300. The information of graphs used in our testing is shown in Table 1.

The PN dataset

Graph | g1 | g2 | g3 | g4 | g5 |
---|---|---|---|---|---|

\(|V|\) | 118 | 443 | 1,454 | 1,723 | 5,300 |

\(|E|\) | 179 | 590 | 1,923 | 2,394 | 6,094 |

**NCI dataset (NCI)** contains the compound structures from the National Cancer Institute Open Database.^{5} The NCI dataset contains 233,281 connected graphs. The average node number is 21.17 and the average node degree is 2.2. According to the number of nodes in a graph, we selected 5 groups, containing graphs with node numbers \(10\pm 4\), \(15\pm 4\), \(20\pm 4\), \(25\pm 4\), and \(30\pm 4\), respectively.

**Scale-Free/Power Law Model (SF)** is a network model whose node degrees follow the power law distribution or at least asymptotically. We generate graphs with node numbers 100, 500, 1,000, 2,500, and 5,000, respectively, with default value 1,000. The average node degree is 4.

**Erdos Renyi Model (ER)** is a classical random graph model. It defines a random graph as \(N\) nodes connected by \(M\) edges that are chosen randomly from the \(N(N -1)/2\) possible edges. We generate graphs with node numbers 100, 500, 1,000, 2,500, and 5,000, respectively, with default value 1,000. The average node degree is 4.

For NCI, since each graph is small, we can compute the optimal matching for each pair of graphs using backtracking method. For graphs in other datasets, it is impossible to compute the optimal matching. Therefore, for any graph \(G_1\), we generate \(G_2\) by randomly inserting/deleting a certain percent of nodes/edges followed by shuffling all nodes.

Parameters

Parameter | Range | Default |
---|---|---|

#nodes | 100, 500, 1,000, 2,500, 5,000 | 1,000 |

Average degree | 2, 3, 4, 5, 6 | 4 |

\(p\) | 0.1, 0.15, 0.2, 0.25, 0.3 | 0.2 |

\(\tau \) | 0.82, 0.86, 0.90, 0.94, 0.98 | – |

\(k\) | 0, 1, 2, 3, 4 | 2 |

#labels | 0, 2, 4, 8, 16, 32 | – |

### 8.1 Comparison with the approximate algorithms

**Vary Graph Size:**We vary the number of nodes in the graphs from 100 to 5,000 and test the matching ratio of each algorithm on PN, SF, and ER. Their results are shown in Fig. 6a–c, respectively. For all cases, the matching ratios for \(\mathsf ume \), \(\mathsf heat \) and \(\mathsf iso \) algorithms are no larger than 0.2. It is because all the three algorithms get the matching by maximizing the total weight of the similarity matrix and consider little about their neighborhood information of two nodes. When the numbers of nodes are no large than 120, the performance of \(\mathsf path \) is similar to \(\mathsf consR \), and reaches a matching ratio above 0.9. When the sizes of the graphs increase, the matching ratio of \(\mathsf path \) decrease. \(\mathsf GA \) performs better than \(\mathsf path \) for large graphs, but still much worse than \(\mathsf consR \) in all cases. When the sizes of the graphs are above 2,500, \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \) cannot generate a result for some test cases under our current computing environment. \(\mathsf path \) even cannot generate a result for graphs with more than 2,000 nodes.

**Vary**\(p\): We vary \(p\) from 0.10 to 0.3, and the results for PN, SF, and ER datasets are shown in Fig. 7a–c, respectively. For all cases, the matching ratios for \(\mathsf ume \), \(\mathsf heat \), and \(\mathsf iso \) algorithms are still no larger than 0.2 for the same reason as analyzed in Fig. 6. The matching ratios for \(\mathsf path \) and \(\mathsf GA \) are constant when \(p\) changes. \(\mathsf consR \) performs best in all cases.

**Vary Initial Matching**: We compare \(\mathsf ume \), \(\mathsf heat \), \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \) with our matching construction algorithm \(\mathsf cons \). Figure 8a shows the results for PN. In most cases, \(\mathsf cons \) performs best among all algorithms. The only exception is the case when the numbers of nodes in graphs are smaller than 120, where \(\mathsf path \) performs better than \(\mathsf cons \). Figure 8d shows the results by applying our matching refinement algorithm to these five algorithms. Accordingly, they are denoted as \(\mathsf umeR \), \(\mathsf heatR \), \(\mathsf isoR \), \(\mathsf pathR \), and \(\mathsf GAR \). For \(\mathsf ume \), \(\mathsf heat \), and \(\mathsf iso \), the matching ratios increase 0.5 after refinement in all cases. Our \(\mathsf consR \) algorithm performs best in all cases after refinement even for the case when the numbers of nodes in the graphs are smaller than 120. Figure 8b, e show the testing results for SF. Our algorithms perform best both before and after refinement in all cases. The results for ER are shown in Fig. 8c, f. The performances for all test cases in ER are similar to those in SF.

**Vary Initial Matching on Small Graphs**: Since \(\mathsf path \) outperforms \(\mathsf cons \) when graph size is around 100 but is beaten by \(\mathsf cons \) when graph size is 500 in Fig. 8, we evaluate two sets of graphs, by varying node numbers in two separate ranges, (0, 100) and (100, 500). The results are shown in Fig. 9. Figure 9a, b show the result for SF and ER with the graph size in (0, 100). \(\mathsf path \) has a better performance than \(\mathsf cons \) in most cases. Figure 9c, d show the matching ratio for SF and ER with the graph size in (100, 500). \(\mathsf path \) only has a better or similar performance than \(\mathsf cons \) when the graph size is less than 200, and it is outperformed by \(\mathsf cons \) when the graph size is larger than 200. Figure 9e–h show the matching ratio after refinement. \(\mathsf consR \) performs the best for all cases. Even though, for the graphs with the size less than 200, \(\mathsf path \) slightly outperforms \(\mathsf cons \), our refinement algorithm \(\mathsf consR \) outperforms \(\mathsf pathR \). In addition, \(\mathsf path \) is quite time-consuming as shown in Table 3. We suggest using \(\mathsf consR \) rather than \(\mathsf pathR \) even for small graphs with size less than 200 considering both effectiveness and efficiency.

**Comparison on Random Pairs**: We compare \(\mathsf ume \), \(\mathsf heat \), \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \) with our algorithm \(\mathsf consR \) on the datasets where every two graphs are randomly selected or generated. For two randomly selected graphs, since we do not know the optimal matched edges, the matching ratio \(MR\) for each algorithm is the relative ratio of its number of matched edges to the best matched edges among all these six algorithms. For real dataset NCI, we randomly select 5 groups, containing graphs with node numbers \(10 \pm 4\), \(50 \pm 4\), \(100 \pm 4\), \(150 \pm 4\), and \(200\pm 4\), respectively. Figure 10a shows the matching ratio for NCI dataset. \(\mathsf path \) performs better than \(\mathsf GA \) which is similar as shown in Fig. 9. \(\mathsf consR \) performs best among all algorithms for all cases. We also generate 5 groups of two random graphs according to SF (ER) model, with the number of nodes ranging from 100 to 5,000. Figure 10b shows the matching ratio for SF dataset. \(\mathsf consR \) performs best among all the algorithms for all cases. \(\mathsf path \) and \(\mathsf GA \) have similar performance when the number of nodes is 100. However \(\mathsf path \) performs badly when the graph size becomes larger. The underlying reason is that \(\mathsf path \) can be easily trapped in a small local maximum in solving the linear combination of the convex relaxation and concave relaxation, since the structures of two randomly generated graphs according to SF model might be quite different and unbalanced. Figure 10c shows the matching ratio for ER model. \(\mathsf path \) and \(\mathsf GA \) have relatively stable performance and \(\mathsf consR \) performs the best among all the algorithms.

Processing time comparison (PN)

Graph | g1 | g2 | g3 | g4 | g5 |
---|---|---|---|---|---|

\(\mathsf selection \) (s) | 0.05 | 0.45 | 20.41 | 33.11 | 1,221.58 |

\(\mathsf expansion \) (s) | 0.02 | 0.08 | 1.11 | 0.97 | 32.92 |

\(\mathsf refinement \) (s) | 0.05 | 0.72 | 6.81 | 10.45 | 205.05 |

\(\mathsf consR \) (s) | 0.11 | 1.25 | 28.33 | 44.53 | 1,459.54 |

\(\mathsf ume \) (s) | 0.05 | 0.48 | 20.63 | 33.41 | 1,561.58 |

\(\mathsf heat \) (s) | 0.11 | 2.06 | 72.79 | 116.47 | 4,117.50 |

\(\mathsf iso \) (s) | 0.12 | 2 | 185 | 57 | - |

\(\mathsf path \) (s) | 14 | 411 | 11,700 | 22,200 | - |

\(\mathsf GA \) (s) | 1 | 8 | 150 | 253 | - |

**Efficiency Testing:** We test the efficiency of the algorithms with PN. The processing time is shown in Table 3. We divide the whole processing time of our algorithm into three phases, namely \(\mathsf selection \), \(\mathsf expansion \), and \(\mathsf refinement \), denoting the processing time for anchor selection, anchor expansion, and matching refinement, respectively. The processing time for \(\mathsf consR \) is the sum of the time for all the three phases. For the three phases, in all cases, the most costly part is anchor selection, because it involves calculating the eigenvalues for matrices. In all test cases, our \(\mathsf consR \) algorithm is faster than \(\mathsf iso \), \(\mathsf heat \), \(\mathsf path \), and \(\mathsf GA \), and is similar to \(\mathsf ume \). For the largest graph g5, the total processing time for \(\mathsf consR \) is smaller than 25 min, while \(\mathsf heat \) needs more than 1 h and \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \) even cannot generate a result under our current computing environment.

### 8.2 Comparison with the exact algorithm

We randomly select 1,000 pairs from the group of graphs with \(20\pm 4\) nodes. For each pair, we compute the ratio of the optimal matched edges to the size of the smaller graph, and vary the ratio from 0.75 to 0.95. For each ratio, we compute the average processing time and average accuracy of our algorithm. The results are shown in Fig. 11c, d. When the ratio increases from 0.75 to 0.95, the processing time of the backtracking algorithm decreases from 10,000 to 1 s. This is because when the ratio is large, the upper bound used to cut branches in the backtracking is tight [26], thus the algorithm stops early. Our \(\mathsf consR \) algorithm consumes no more than 0.1 s in all cases. The average accuracy (matching ratio) for our \(\mathsf cons \) algorithm is 0.8 and the average accuracy for our \(\mathsf consR \) algorithm is 0.95.

### 8.3 Parameter and scalability testing

**Vary Global Similarity Measures:**We compare the matching results of our algorithm \(\mathsf cons \) and \(\mathsf consR \) using three different global similarity measures: spectral similarity, Katz score, and RWR score, denoted as \(\mathsf Spec \), \(\mathsf Katz \) and \(\mathsf RWR \), respectively. We set the attenuation factor \(b\) used in Katz \(1/(d+1)\) by default [14], where \(d\) is the maximum degree of the graph. For the parameter \(c\) in \(\mathsf RWR \), we use the same setting as [30]. Figure 12 shows the performances of these three global similarity measures. The matching ratio of construction is shown in Fig. 12a–c. \(\mathsf Katz \) is outperformed by \(\mathsf Spec \) and \(\mathsf RWR \) in most cases. \(\mathsf Spec \) performs best in most cases. \(\mathsf RWR \) performs better in a few cases. However, the differences among the three are marginal. The underlying reason is that both Katz score and RWR score originate from the idea of random walks where the stationary distribution converges to the largest eigenvector. Our refinement algorithm can refine, using any of them, to a better result as shown in Fig. 12d–f.

**Vary**\({\varvec{\tau }}\): We first test the sensitivity of \(\tau \), and show the representative \(\tau \) values from \(0.82\) to \(0.98\) which is the similarity threshold used in anchor selection (Algorithm 2). We list \(MR\) for the two steps: construction and refinement for the 5 graphs. As shown in Fig. 13a, with the PN dataset, for all the \(5\) graphs, when increasing \(\tau \), the matching ratio increases to a peak value followed by decreasing. This is because when \(\tau \) is small, the number of anchors is large, some mismatched anchors are thus involved. When \(\tau \) is large, the number of anchors is too small, thus few nodes are referenced when expanding. The peak value of \(MR\) varies from 0.7 to 0.9. For different graphs, the \(\tau \) values that generate the peak value are different. In Fig. 13d, when increasing \(\tau \), the \(MR\) value also increases followed by decreasing. Comparing to Fig. 13a, after refinement, the matching ratio increases by 10 % on average, and the average \(MR\) is above \(0.95\) when fixing \(\tau \) to be the default value \(0.9\). The results for the SF dataset when varying \(\tau \) from 0.82 to 0.98 in the 5 cases are shown in Fig. 13b, e. For the matching construction step, the matching ratio increases followed by decreasing, and the peak value is 0.8 on average. After refinement, most matching ratios increase to 1, except for several cases when \(\tau \) is too large. This is because when \(\tau \) is large, few anchors are selected, thus in the expansion step, a lot of nodes are mismatched. The large number of mismatched nodes can hardly be refined perfectly. Figure 13c, f show the situations in the ER dataset, the performances in the construction step are similar to the SF dataset. In the refinement step, when \(\tau \) is small, the matching can hardly be refined perfectly. This is because when \(\tau \) is small, a lot of anchors are selected, and in the ER dataset, all degrees have a uniform distribution. Thus, the number of mismatched anchors is large, which induces the bad performances of the refinement step.

**Vary**

**k**: We vary \(k\) which is the \(k\)-neighborhood used for local similarity computation. Figure 14a shows the results for the construction step, with the PN dataset, when varying \(k\) from 0 to 4. The performance is best when \(k\) is 1 or 2. This is because when \(k\) is small, very little local information is involved in the local similarity, and when \(k\) is large, too much noise is added to the local similarity. Figure 14d shows that, after refinement, the curves for all graphs are similar as those in the construction step, but the \(MR\) values increase by 10 % on average. When \(k\) is the default value \(2\), the average matching ratio can reach \(0.95\) in most cases. Figure 14b, e show the results of the two steps on the SF dataset when varying \(k\). In all cases, \(k=2\) always leads to the best matching for the construction step. In the refinement step, the matching ratios for \(k=1\) in all cases largely increase, although their performances in the first step are not so good. This is because when \(k=1\), the anchor selection can select good pairs of anchors, and the errors induced by the anchor expansion are easier to be repaired in the refinement step. The results on the ER dataset when varying \(k\) are shown in Fig. 14c, f. For the construction step, the performances are similar to that on the SF dataset. For the refinement step, when the number of nodes is large, the errors induced in the construction step when \(k=1\) can hardly be repaired. This is because in the ER dataset, the degrees for all nodes are uniformly distributed. When \(k=1\), even the anchor selection cannot select good anchor pairs. As a result, it is hard for the refinement step to generate a good matching.

**Vary**

**p**

**and**

**degree**: We compare our algorithms \(\mathsf cons \) and \(\mathsf consR \) by varying \(p\) from 0.1 to 0.3. The results on PN are shown in Fig. 15a. When \(p\) increases, the matching ratio for both \(\mathsf cons \) and \(\mathsf consR \) decreases. This is because when \(p\) increases, the similarity of the graphs to be matched decreases. \(\mathsf consR \) is 10 % better than \(\mathsf cons \) on average. The average matching ratio for \(\mathsf consR \) in all cases is above 0.95. The results for SF and ER when varying \(p\) are similar to PN. Figure 15d shows the testing results on SF when varying the average degree.

^{6}It shows that when the average degree increases, the matching ratio for \(\mathsf cons \) decreases. This is because when the average degree is large, the anchor expansion algorithm keeps a lot of pairs in the queue in early stages, and thus increases the probability for nodes to be mismatched. \(\mathsf consR \) is consistent and increases the matching ratio to 1 in most cases. The performance on ER is similar to the performance on SF when varying average degree. The results when varying \(p\) in the SF and ER datasets are shown in Fig. 15b, c, respectively. When \(p\) increases, the matching ratios of \(\mathsf cons \) decrease in both datasets. The matching ratios with SF decrease slower, because with SF the degrees of nodes show a power law distribution. Although the similarity of the two graphs decreases, the anchors also have a large chance to be matched correctly. Our \(\mathsf consR \) algorithm can correct the errors caused in the initial matching perfectly, and increase the average matching ratio to above 0.98 in both datasets.

### 8.4 Sensitivity of randomness (PN)

Sensitivity of randomness

Graph | g1 | g2 | g3 | g4 | g5 |
---|---|---|---|---|---|

Average MR | 0.860 | 0.950 | 0.961 | 0.949 | 0.975 |

SD | 0.008 | 0.008 | 0.002 | 0.003 | 0.002 |

### 8.5 Effectiveness of label distribution

In this experiment, we compare our algorithms, \(\mathsf cons \) and \(\mathsf consR \), with five other algorithms: \(\mathsf ume \), \(\mathsf heat \), \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \) on labeled graphs. Since \(\mathsf ume \) and \(\mathsf heat \) are algorithms originally designed for unlabeled graphs, we modify them to handle labeled graphs by setting the similarity score of two nodes with different labels to 0 before applying Hungarian algorithm on the similarity matrix to obtain the maximum weight bipartite matching. For \(\mathsf iso \), \(\mathsf path \), and \(\mathsf GA \), we use the setting for the case of constrained graph matching [37] in which only nodes with the same label can be matched with each other. In other words, we set the node similarity to 1 for two nodes \(u \in G_1\) and \(v \in G_2\) with the same label and 0 otherwise, and set the objective as only maximizing the number of matched edges with the label constraint.

We evaluate these algorithms on both real dataset PN (Power Law model) and synthetic dataset ER (Erdos Renyi model). In PN dataset, we use graph g3 as default, and in ER dataset, we use the graph with 1,000 nodes as default. We evaluate two different node-label distributions: uniform distribution and power law distribution. For the former, we assign the labels to nodes in the graph uniformly, and for the latter, we assign the labels to nodes in the graph according to a power law probability distribution \(p(x) = C \cdot x^{-\alpha }\). Since for most of real networks, \( 2< \alpha < 3 \) [23], we set \(\alpha = 2.5\) in our experiment. All other parameters are set to the default values listed in Table 2.

## 9 Conclusions

In this paper, we study how to find a matching of two large graphs or how to score how similar two large graphs can be in terms of the possibly maximum number of matched edges. This is known to be NP-hard. We give a new two-step approach which ensures high efficiency and high quality. Our solution can be applied to both unlabeled and labeled graphs. We conducted extensive testing using real and synthetic datasets, and confirmed the quality and efficiency of our approach.

## Acknowledgments

The work was supported by the Research Grants Council of the Hong Kong SAR, China (419109), ARC Discovery Grants (ARCDP0987557, ARCDP110102937, ARCDP120104168), and NSFC61021004.